Do you have any related native multimodal architecture diagrams?

#4
by jackkuo - opened

Great work, especially since you mentioned it's natively multimodal. Do you have any related native multimodal architecture diagrams? Or is it following the same logic as Kimi-VL?"

Moonshot AI org

It is an ungraded version compared to Kimi-VL, especially featuring video understanding. Will release more details later.

Hello. I noticed in the paper that Kimi-K2.5 employs the same vision encoder as Kimi-VL for video processing. As Kimi-VL does not process audio, I am curious about the API's capabilities. Does Kimi-K2.5 API uses a video transcription (like Kimi ASR), or the model served in the API has a built in an audio encoder?

Sign up or log in to comment