Existing benchmarks for deep research agents (DRAs) assess only single-shot outputs, ignoring a key question: can DRAs improve their reports when guided by feedback? To investigate this, we conduct a multi-turn evaluation of DRAs under two feedback settings: self-reflection, in which the agent revises its report without any external diagnostic signal, and process-level feedback, in which the agent receives guidance targeting gaps in its research strategy. To enable process-level feedback, we design Research Gap Inference (RGI), a method that analyzes patterns of satisfied and unsatisfied rubric criteria to infer research-process gaps. Our analysis reveals three key findings: (i) under self-reflection, agents incorporate and regress on rubric criteria at nearly equal rates, yielding negligible net improvement; (ii) a single round of process-level feedback yields substantial gains, raising the normalized score by approximately $8$-$15$ points and yielding a roughly $35$-$40\%$ incorporation rate; (iii) these gains do not compound over subsequent turns, as agents regress on up to $24\%$ of previously satisfied criteria when rewriting the full report to address remaining gaps. Even with targeted guidance, reliable multi-turn improvement remains out of reach for the DRA architectures we evaluate. Our code and results are publicly available at https://github.com/sabharwalrishabh/Multi-Turn-Evaluation-of-DRAs.
翻译:[translated abstract in Chinese]
现有的深度研究智能体基准仅评估单次输出结果,忽略了一个关键问题:在有反馈指导时,DRA能否改进其报告?为探究此问题,我们在两种反馈设置下对DRA进行了多轮评估:自我反思(智能体在无任何外部诊断信号的情况下修改报告)和过程级反馈(智能体接收针对其研究策略缺点的指导)。为支持过程级反馈,我们设计了研究缺口推断方法,通过分析已满足和未满足的评分标准模式来推断研究过程中的缺口。分析揭示了三个关键发现:(i)在自我反思下,智能体采纳和偏离评分标准的频率几乎相等,导致净改进可忽略不计;(ii)单轮过程级反馈带来显著提升,将标准化分数提高约8-15分,并带来约35-40%的采纳率;(iii)这些增益在后续轮次中并未累积,因为智能体在重写完整报告以解决剩余缺口时,会偏离高达24%的先前已满足标准。即使有针对性指导,我们所评估的DRA架构仍无法实现可靠的多轮改进。我们的代码和结果公开于https://github.com/sabharwalrishabh/Multi-Turn-Evaluation-of-DRAs。