Hunt Instead of Wait: Evaluating Deep Data Research on Large Language Models
Abstract
Agentic large language models require investigatory intelligence for autonomous data analysis, demonstrated through the Deep Data Research benchmark that evaluates their ability to extract insights from databases without explicit queries.
The agency expected of Agentic Large Language Models goes beyond answering correctly, requiring autonomy to set goals and decide what to explore. We term this investigatory intelligence, distinguishing it from executional intelligence, which merely completes assigned tasks. Data Science provides a natural testbed, as real-world analysis starts from raw data rather than explicit queries, yet few benchmarks focus on it. To address this, we introduce Deep Data Research (DDR), an open-ended task where LLMs autonomously extract key insights from databases, and DDR-Bench, a large-scale, checklist-based benchmark that enables verifiable evaluation. Results show that while frontier models display emerging agency, long-horizon exploration remains challenging. Our analysis highlights that effective investigatory intelligence depends not only on agent scaffolding or merely scaling, but also on intrinsic strategies of agentic models.
Community
This paper introduces Deep Data Research, shifting from executional intelligence, which focuses on completing assigned tasks, to investigatory intelligence, where agents autonomously set goals and explore. Under this paradigm, Agentic LLMs are allowed to explore databases freely, discovering insights behind the data, without any predefined queries, questions, or objectives.
The LLM-generated insights are evaluated against a fact checklist derived from the freeform components of the database, which naturally yields an aligned and objective evaluation.
Beyond presenting DDR-Bench, the paper's experimental analysis reveals some insights into investigatory intelligence:
Inference-time Scaling Dynamics: Top models exhibit "quality over quantity". They delay commitment and concentrate reasoning into a few high-value late-stage interactions. Token scaling shows flat-then-sharp patterns where final-stage tokens deliver disproportionate value, signalling depth-first exploration after breadth-oriented search.
Balanced Exploration Regime: Entropy-based visualisation reveals advanced models consistently operate in a balanced regime combining coverage with focus, supporting an "implicit planning hypothesis" where strong models maintain coherent exploration strategies without explicit scaffolding.
Training Trumps Scaling: Analysis of the Qwen family shows parameter scaling alone yields marginal gains (under 3% from 10× parameters), and longer context windows don't consistently help. However, cross-generation models with agentic-first training, including targeted pre-training and reinforcement learning, achieve substantially higher ceilings despite fewer activated parameters, demonstrating that meaningful agency requires intentional training strategies over mere scale.
Scaffolding Paradox: Adding sophisticated frameworks like reasoning or memory mechanisms leads to unpredictable behaviours and may degrade performance. Proactive versus reactive comparison shows substantial gaps, confirming that autonomous goal-setting demands far exceed executing predefined objectives.
Failure Modes: 58% of errors stem from insufficient exploration breadth/depth, while 40% involve other issues: powerful models over-reason with unsupported assumptions, weaker models struggle with instruction-following.
Models citing this paper 0
No model linking this paper