Diagnosing the failure mechanisms of Deep Research Agents (DRAs) remains a critical challenge. Existing benchmarks predominantly rely on end-to-end evaluation, obscuring critical intermediate hallucinations, such as flawed planning, that accumulate throughout the research trajectory. To bridge this gap, we propose a shift from outcome-based to process-aware evaluation by auditing the full research trajectory. We introduce the PIES Taxonomy to categorize hallucinations along functional components (Planning vs. Summarization) and error properties (Explicit vs. Implicit). We instantiate this taxonomy into a fine-grained evaluation framework that decomposes the trajectory to rigorously quantify these hallucinations. Leveraging this framework to isolate 100 distinctively hallucination-prone tasks including adversarial scenarios, we curate DeepHalluBench. Experiments on six state-of-theart DRAs reveal that no system achieves robust reliability. Furthermore, our diagnostic analysis traces the etiology of these failures to systemic deficits, specifically hallucination propagation and cognitive biases, providing foundational insights to guide future architectural optimization. Data and code are available at https://github.com/yuhao-zhan/DeepHalluBench.
翻译:诊断深度研究代理(DRA)的失败机制仍然是一个关键挑战。现有基准主要依赖端到端评估,这掩盖了在研究轨迹中不断累积的关键中间幻觉,例如有缺陷的规划。为弥补这一差距,我们提出从基于结果的评估转向过程感知评估,通过对完整研究轨迹进行审计。我们引入了PIES分类法,根据功能组件(规划与总结)和错误属性(显式与隐式)对幻觉进行分类。我们将此分类法实例化为一个细粒度评估框架,该框架分解研究轨迹以严格量化这些幻觉。利用该框架分离出100个独特的易产生幻觉的任务(包括对抗性场景),我们构建了DeepHalluBench。对六个最先进的DRA进行的实验表明,没有系统能实现稳健的可靠性。此外,我们的诊断分析将这些失败的病因追溯到系统性缺陷,特别是幻觉传播和认知偏差,为指导未来的架构优化提供了基础性见解。数据和代码可在 https://github.com/yuhao-zhan/DeepHalluBench 获取。