Small Vision-Language Models are Smart Compressors for Long Video Understanding Paper • 2604.08120 • Published 7 days ago • 20
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web Paper • 2604.08516 • Published 7 days ago • 41
When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models Paper • 2604.08546 • Published 7 days ago • 114
SkillClaw: Let Skills Evolve Collectively with Agentic Evolver Paper • 2604.08377 • Published 7 days ago • 276
ClawBench: Can AI Agents Complete Everyday Online Tasks? Paper • 2604.08523 • Published 7 days ago • 255
ELT: Elastic Looped Transformers for Visual Generation Paper • 2604.09168 • Published 6 days ago • 19
WildDet3D: Scaling Promptable 3D Detection in the Wild Paper • 2604.08626 • Published 7 days ago • 232
Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music Paper • 2604.10905 • Published 3 days ago • 24
Strips as Tokens: Artist Mesh Generation with Native UV Segmentation Paper • 2604.09132 • Published 6 days ago • 49
Running 3.78k The Ultra-Scale Playbook 🌌 3.78k The ultimate guide to training LLM on large GPU Clusters