Deep Research Agents (DRAs) aim to automatically produce analyst-level reports through iterative information retrieval and synthesis. However, most existing DRAs were validated on question-answering benchmarks, while research on generating comprehensive reports remains overlooked. Worse, current benchmarks for report synthesis suffer from task complexity and subjective metrics -- this fails to reflect user demands and limits the practical utility of generated reports. To address these gaps, we present Fine-grained DEepResearch bench (FINDER), an enhanced benchmark consisting of 100 human-curated research tasks with 419 structured checklist items that standardize report structure, analytical depth, and factual grounding. Based on approximately 1,000 reports produced by mainstream DRAs, we further propose Deep rEsearch Failure Taxonomy (DEFT), the first failure taxonomy for deep research agents. DEFT contains 14 fine-grained failure modes across reasoning, retrieval, and generation, and is built upon grounded theory with human-LLM co-annotating and inter-annotator reliability validation. Our experimental findings reveal that current DRAs struggle not with task comprehension but with evidence integration, verification, and reasoning-resilient planning.
翻译:深度研究智能体旨在通过迭代式信息检索与综合自动生成分析师级别的报告。然而,现有大多数深度研究智能体的验证仍局限于问答基准测试,而针对生成综合性报告的研究尚未得到充分关注。更严重的是,当前用于报告合成的基准测试存在任务复杂度高和评价指标主观性强的问题——这既无法反映用户真实需求,也限制了生成报告的实际效用。为填补这些空白,我们提出了细粒度深度研究基准,这是一个包含100项人工策划研究任务的增强型基准,涵盖419个结构化检查项,用于标准化报告结构、分析深度与事实依据。基于主流深度研究智能体生成的约1000份报告,我们进一步提出了深度研究失败分类法,这是首个针对深度研究智能体的失败类型学体系。该分类法包含推理、检索与生成三大维度下的14种细粒度失效模式,并基于扎根理论构建,采用人机协同标注与标注者间一致性验证。实验结果表明,当前深度研究智能体的主要瓶颈并非任务理解能力,而在于证据整合、事实核查以及具备推理韧性的规划能力。