MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome
Abstract
MiroEval addresses limitations of existing deep research system benchmarks by introducing a comprehensive evaluation framework that assesses adaptive synthesis, agentic factuality verification, and process-centric auditing across real-user tasks.
Recent progress in deep research systems has been impressive, but evaluation still lags behind real user needs. Existing benchmarks predominantly assess final reports using fixed rubrics, failing to evaluate the underlying research process. Most also offer limited multimodal coverage, rely on synthetic tasks that do not reflect real-world query complexity, and cannot be refreshed as knowledge evolves. To address these gaps, we introduce MiroEval, a benchmark and evaluation framework for deep research systems. The benchmark comprises 100 tasks (70 text-only, 30 multimodal), all grounded in real user needs and constructed via a dual-path pipeline that supports periodic updates, enabling a live and evolving setting. The proposed evaluation suite assesses deep research systems along three complementary dimensions: adaptive synthesis quality evaluation with task-specific rubrics, agentic factuality verification via active retrieval and reasoning over both web sources and multimodal attachments, and process-centric evaluation audits how the system searches, reasons, and refines throughout its investigation. Evaluation across 13 systems yields three principal findings: the three evaluation dimensions capture complementary aspects of system capability, with each revealing distinct strengths and weaknesses across systems; process quality serves as a reliable predictor of overall outcome while revealing weaknesses invisible to output-level metrics; and multimodal tasks pose substantially greater challenges, with most systems declining by 3 to 10 points. The MiroThinker series achieves the most balanced performance, with MiroThinker-H1 ranking the highest overall in both settings. Human verification and robustness results confirm the reliability of the benchmark and evaluation framework. MiroEval provides a holistic diagnostic tool for the next generation of deep research agents.
Community
We introduce MiroEval, a benchmark and evaluation framework for deep research systems with 100 tasks (70 text-only, 30 multimodal). Unlike existing benchmarks that only assess final reports, MiroEval evaluates systems along three dimensions: adaptive synthesis quality, agentic factuality verification, and process-centric evaluation.
We benchmark 13 leading systems including OpenAI Deep Research, Gemini, Claude, Grok, Manus, Kimi, and others. Key findings: process quality reliably predicts overall outcome (r=0.88); multimodal tasks cause 3โ10 point drops; and synthesis quality vs. factuality rankings diverge substantially across systems.
๐ Blog: https://miroeval-ai.github.io/blog/
๐ Project: https://miroeval-ai.github.io/website/
๐ GitHub: https://github.com/MiroMindAI/MiroEval
the core of miroeval that grabs me is the attachment-aware, rubric-driven evaluation plus a process audit, which tries to assess how researchers actually work, not just what they output. that four-way labeling RIGHT, WRONG, CONFLICT, UNKNOWN for factual anchors is clever, but iโm curious how they calibrate across judges and how sensitive it is to tricky chart interpretations. an ablation where attachments are removed or where only textual anchors are used would reveal how much the multimodal verification actually contributes to the final score. there are worries about knowledge drift in the live setting and whether the framework could be gamed by querying or backfilling sources to look good. btw the arxivlens breakdown helped me parse the method details, a solid walkthrough on how the live, multi-layer eval hangs together, and this link helps keep it concrete: https://arxivlens.com/PaperView/Details/miroeval-benchmarking-multimodal-deep-research-agents-in-process-and-outcome-7188-42258562
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DREAM: Deep Research Evaluation with Agentic Metrics (2026)
- MultiHaystack: Benchmarking Multimodal Retrieval and Reasoning over 40K Images, Videos, and Documents (2026)
- BrowseComp-V3: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents (2026)
- VisBrowse-Bench: Benchmarking Visual-Native Search for Multimodal Browsing Agents (2026)
- ContextBench: A Benchmark for Context Retrieval in Coding Agents (2026)
- LiveAgentBench: Comprehensive Benchmarking of Agentic Systems Across 104 Real-World Challenges (2026)
- HippoCamp: Benchmarking Contextual Agents on Personal Computers (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
been following the deep research agent space for a while, interesting to see a proper benchmark for multimodal research agents. this covers both the process and outcome which most evals skip. good summary here https://arxivexplained.com/papers/miroeval-benchmarking-multimodal-deep-research-agents-in-process-and-outcome
been following the deep research agent space for a while, interesting to see a proper benchmark for multimodal research agents. this covers both the process and outcome which most evals skip. good summary here https://arxivexplained.com/papers/miroeval-benchmarking-multimodal-deep-research-agents-in-process-and-outcome
Get this paper in your agent:
hf papers read 2603.28407 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper