Do you have any related native multimodal architecture diagrams?

by jackkuo - opened Jan 27

Jan 27

Great work, especially since you mentioned it's natively multimodal. Do you have any related native multimodal architecture diagrams? Or is it following the same logic as Kimi-VL?"

teowu

Moonshot AI org Jan 27

It is an ungraded version compared to Kimi-VL, especially featuring video understanding. Will release more details later.

MiguelMendez101

22 days ago

Hello. I noticed in the paper that Kimi-K2.5 employs the same vision encoder as Kimi-VL for video processing. As Kimi-VL does not process audio, I am curious about the API's capabilities. Does Kimi-K2.5 API uses a video transcription (like Kimi ASR), or the model served in the API has a built in an audio encoder?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment