The agency expected of Agentic Large Language Models goes beyond answering correctly, requiring autonomy to set goals and decide what to explore. We term this investigatory intelligence, distinguishing it from executional intelligence, which merely completes assigned tasks. Data Science provides a natural testbed, as real-world analysis starts from raw data rather than explicit queries, yet few benchmarks focus on it. To address this, we introduce Deep Data Research (DDR), an open-ended task where LLMs autonomously extract key insights from databases, and DDR-Bench, a large-scale, checklist-based benchmark that enables verifiable evaluation. Results show that while frontier models display emerging agency, long-horizon exploration remains challenging. Our analysis highlights that effective investigatory intelligence depends not only on agent scaffolding or merely scaling, but also on intrinsic strategies of agentic models.
翻译:智能体化大型语言模型所应具备的能动性不仅要求正确回答问题,更需要自主设定目标并决策探索方向。我们将这种能力称为调查性智能,以区别于仅完成预设任务的执行性智能。数据科学为此提供了天然试验场,因为现实世界的数据分析始于原始数据而非明确查询,但现有基准测试鲜少关注这一维度。为此,我们提出深度数据研究——一项让大型语言模型从数据库中自主提取关键洞见的开放式任务,并构建了DDR-Bench这一基于核查清单的大规模可验证评估基准。实验结果表明,尽管前沿模型展现出初步的能动性,但长周期探索仍具挑战。我们的分析进一步揭示:有效的调查性智能不仅依赖于智能体框架构建或单纯规模扩展,更取决于智能体化模型的内在策略。