Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration Paper • 2309.01131 • Published Sep 3, 2023 • 1
Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models Paper • 2402.19014 • Published Feb 29, 2024
Talk With Human-like Agents: Empathetic Dialogue Through Perceptible Acoustic Reception and Reaction Paper • 2406.12707 • Published Jun 18, 2024
Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy Paper • 2502.05177 • Published Feb 7, 2025 • 2
VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model Paper • 2505.03739 • Published May 6, 2025 • 9
VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation Paper • 2510.09607 • Published Oct 10, 2025 • 2
VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting Paper • 2510.21817 • Published Oct 21, 2025 • 42
Input Domain Aware MoE: Decoupling Routing Decisions from Task Optimization in Mixture of Experts Paper • 2510.16448 • Published Oct 18, 2025
TACO: Think-Answer Consistency for Optimized Long-Chain Reasoning and Efficient Data Learning via Reinforcement Learning in LVLMs Paper • 2505.20777 • Published May 27, 2025
Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision Paper • 2601.19798 • Published 3 days ago • 38
BASIC: Boosting Visual Alignment with Intrinsic Refined Embeddings in Multimodal Large Language Models Paper • 2508.06895 • Published Aug 9, 2025
Youtu-Parsing: Perception, Structuring and Recognition via High-Parallelism Decoding Paper • 2601.20430 • Published 2 days ago • 14
Youtu-Parsing: Perception, Structuring and Recognition via High-Parallelism Decoding Paper • 2601.20430 • Published 2 days ago • 14
Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision Paper • 2601.19798 • Published 3 days ago • 38