A surge in academic publications calls for automated deep research (DR) systems, but accurately evaluating them is still an open problem. First, existing benchmarks often focus narrowly on retrieval while neglecting high-level planning and reasoning. Second, existing benchmarks favor general domains over the academic domains that are the core application for DR agents. To address these gaps, we introduce ADRA-Bank, a modular benchmark for Academic DR Agents. Grounded in academic literature, our benchmark is a human-annotated dataset of 200 instances across 10 academic domains, including both research and review papers. Furthermore, we propose a modular Evaluation Paradigm for Academic DR Agents (ADRA-Eval), which leverages the rich structure of academic papers to assess the core capabilities of planning, retrieval, and reasoning. It employs two complementary modes: an end-to-end evaluation for \task agents and an isolated evaluation for foundational LLMs as potential backbones. Results reveal uneven capabilities: while agents show specialized strengths, they struggle with multi-source retrieval and cross-field consistency. Moreover, improving high-level planning capability is the crucial factor for unlocking the reasoning potential of foundational LLMs as backbones. By exposing these actionable failure modes, ADRA-Bank provides a diagnostic tool to guide the development of more reliable automatic academic research assistants.
翻译:学术文献的激增催生了自动化深度研究系统的需求,但如何准确评估这些系统仍是一个开放性问题。首先,现有基准往往狭隘地聚焦于检索任务,而忽视了高层规划与推理能力。其次,现有基准偏向通用领域,而非深度研究智能体的核心应用场景——学术领域。为弥补这些不足,我们提出了ADRA-Bank,一个面向学术深度研究智能体的模块化基准。该基准基于学术文献构建,是一个包含10个学术领域(涵盖研究型与综述型论文)共200条实例的人工标注数据集。进一步,我们提出了学术深度研究智能体评估范式(ADRA-Eval),该范式利用学术论文的丰富结构来评估规划、检索与推理三大核心能力。它采用两种互补的评估模式:面向任务智能体的端到端评估,以及面向作为潜在基座的基础大模型的隔离评估。评估结果揭示了能力的不均衡性:尽管智能体展现出特定优势,但在多源检索与跨领域一致性方面仍存在困难。此外,提升高层规划能力是释放基础大模型作为基座之推理潜力的关键因素。通过揭示这些可操作的失败模式,ADRA-Bank为开发更可靠的自动化学术研究助手提供了一个诊断性工具。