VimRAG: Navigating Massive Visual Context in Retrieval-Augmented Generation via Multimodal Memory Graph Paper • 2602.12735 • Published Feb 13 • 8
WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM Paper • 2509.21990 • Published Sep 26, 2025 • 1
A Simple Baseline for Streaming Video Understanding Paper • 2604.02317 • Published 19 days ago • 72
LongCat-Next: Lexicalizing Modalities as Discrete Tokens Paper • 2603.27538 • Published 23 days ago • 144
PEARL: Personalized Streaming Video Understanding Model Paper • 2603.20422 • Published Mar 20 • 40
Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled Image-Text-to-Text • 28B • Updated 15 days ago • 567k • 2.74k
Running Agents 64 Qwen3.5 Omni Online Demo 📚 64 Chat with a multimodal AI using text, image, audio, or video
NLE: Non-autoregressive LLM-based ASR by Transcript Editing Paper • 2603.08397 • Published Mar 9 • 22
MAISI-v2: Accelerated 3D High-Resolution Medical Image Synthesis with Rectified Flow and Region-specific Contrastive Loss Paper • 2508.05772 • Published Aug 7, 2025 • 3