Efficient Autoregressive Video Diffusion with Dummy Head Paper • 2601.20499 • Published 13 days ago • 7
Quantifying the Gap between Understanding and Generation within Unified Multimodal Models Paper • 2602.02140 • Published 8 days ago • 12
SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild? Paper • 2602.03916 • Published 7 days ago • 11
Horizon-LM: A RAM-Centric Architecture for LLM Training Paper • 2602.04816 • Published 6 days ago • 16
Idea2Story: An Automated Pipeline for Transforming Research Concepts into Complete Scientific Narratives Paper • 2601.20833 • Published 13 days ago • 174
Executable Code Actions Elicit Better LLM Agents Paper • 2402.01030 • Published Feb 1, 2024 • 187
QuantiPhy: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models Paper • 2512.19526 • Published Dec 22, 2025 • 12
Running 3.67k The Ultra-Scale Playbook 🌌 3.67k The ultimate guide to training LLM on large GPU Clusters
Cosmos-Tokenizer Collection A suite of image and video tokenizers • 13 items • Updated 5 days ago • 43
VGGHeads: A Large-Scale Synthetic Dataset for 3D Human Heads Paper • 2407.18245 • Published Jul 25, 2024 • 12
dima806/facial_emotions_image_detection Image Classification • 85.8M • Updated Oct 19, 2024 • 55.7k • 117
SEED-Story: Multimodal Long Story Generation with Large Language Model Paper • 2407.08683 • Published Jul 11, 2024 • 24