Abstract
EVA is an efficient reinforcement learning framework for video understanding that enables adaptive reasoning through iterative planning and attention mechanisms, outperforming existing methods on multiple video benchmarks.
Video understanding with multimodal large language models (MLLMs) remains challenging due to the long token sequences of videos, which contain extensive temporal dependencies and redundant frames. Existing approaches typically treat MLLMs as passive recognizers, processing entire videos or uniformly sampled frames without adaptive reasoning. Recent agent-based methods introduce external tools, yet still depend on manually designed workflows and perception-first strategies, resulting in inefficiency on long videos. We present EVA, an Efficient Reinforcement Learning framework for End-to-End Video Agent, which enables planning-before-perception through iterative summary-plan-action-reflection reasoning. EVA autonomously decides what to watch, when to watch, and how to watch, achieving query-driven and efficient video understanding. To train such agents, we design a simple yet effective three-stage learning pipeline - comprising supervised fine-tuning (SFT), Kahneman-Tversky Optimization (KTO), and Generalized Reward Policy Optimization (GRPO) - that bridges supervised imitation and reinforcement learning. We further construct high-quality datasets for each stage, supporting stable and reproducible training. We evaluate EVA on six video understanding benchmarks, demonstrating its comprehensive capabilities. Compared with existing baselines, EVA achieves a substantial improvement of 6-12% over general MLLM baselines and a further 1-3% gain over prior adaptive agent methods. Our code and model are available at https://github.com/wangruohui/EfficientVideoAgent.
Community
Video understanding with multimodal large language models (MLLMs) remains challenging due to the long token sequences of videos, which contain extensive temporal dependencies and redundant frames. Existing approaches typically treat MLLMs as passive recognizers, processing entire videos or uniformly sampled frames without adaptive reasoning. Recent agent-based methods introduce external tools, yet still depend on manually designed workflows and perception-first strategies, resulting in inefficiency on long videos. We present EVA, an Efficient Reinforcement Learning framework for End-to-End Video Agent, which enables planning-before-perception through iterative summary–plan–action–reflection reasoning. EVA autonomously decides what to watch, when to watch, and how to watch, achieving query-driven and efficient video understanding. To train such agents, we design a simple yet effective three-stage learning pipeline—comprising supervised fine-tuning (SFT), Kahneman–Tversky Optimization (KTO), and Group Relative Policy Optimization (GRPO)—that bridges supervised imitation and reinforcement learning. We further construct high-quality datasets for each stage, supporting stable and reproducible training. We evaluate EVA on six video understanding benchmarks, demonstrating its comprehensive capabilities. Compared with existing baselines, EVA achieves a substantial improvement of 6--12% over general MLLM baselines and a further 1--3% gain over prior adaptive agent methods.
the planning-before-perception loop EVA uses to allocate a visual budget and do iterative summary-plan-action-reflection is the most interesting part here. my main question is: how much of the gains come from the reflection stage itself? an ablation that removes reflection and keeps only summary-plan-action would really tell if reflection is essential. if reflection adds little, that would simplify training and make end-to-end adaptivity even more practical for long videos. btw the arxivlens breakdown helped me parse the method details; there is a nice walkthrough on arxivlens that covers this well: https://arxivlens.com/PaperView/Details/eva-efficient-reinforcement-learning-for-end-to-end-video-agent-4899-5b4fbeb9. i’d also be curious how this scales with extremely long, densely labeled videos, i wonder if a fixed budget per segment could match the gains in practice.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VideoBrain: Learning Adaptive Frame Sampling for Long Video Understanding (2026)
- Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning (2026)
- Weaver: End-to-End Agentic System Training for Video Interleaved Reasoning (2026)
- LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding (2026)
- VideoTemp-o3: Harmonizing Temporal Grounding and Video Understanding in Agentic Thinking-with-Videos (2026)
- VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking (2026)
- Towards Sparse Video Understanding and Reasoning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2603.22918 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper