Existing benchmarks for Deep Research Agents (DRAs) treat report generation as a single-shot writing task, which fundamentally diverges from how human researchers iteratively draft and revise reports via self-reflection or peer feedback. Whether DRAs can reliably revise reports with user feedback remains unexplored. We introduce Mr Dre, an evaluation suite that establishes multi-turn report revision as a new evaluation axis for DRAs. Mr Dre consists of (1) a unified long-form report evaluation protocol spanning comprehensiveness, factuality, and presentation, and (2) a human-verified feedback simulation pipeline for multi-turn revision. Our analysis of five diverse DRAs reveals a critical limitation: while agents can address most user feedback, they also regress on 16-27% of previously covered content and citation quality. Over multiple revision turns, even the best-performing agents leave significant headroom, as they continue to disrupt content outside the feedback's scope and fail to preserve earlier edits. We further show that these issues are not easily resolvable through inference-time fixes such as prompt engineering and a dedicated sub-agent for report revision.
翻译:现有深度研究代理(DRA)的基准测试将报告生成视为单次撰写任务,这从根本上偏离了人类研究者通过自我反思或同行反馈迭代起草和修订报告的方式。DRA能否可靠地根据用户反馈修订报告仍未被探索。我们引入了Mr Dre评估套件,它将多轮报告修订确立为DRA的一个新评估维度。Mr Dre包含(1)一个统一的、涵盖全面性、事实性和呈现性的长篇报告评估协议,以及(2)一个用于多轮修订的人工验证反馈模拟流程。我们对五个不同DRA的分析揭示了一个关键局限:虽然代理能够处理大部分用户反馈,但它们也会在16-27%先前已涵盖的内容和引用质量上出现倒退。经过多轮修订,即使表现最佳的代理也留下了显著的改进空间,因为它们持续干扰反馈范围之外的内容,并且无法保留先前的编辑。我们进一步表明,这些问题不易通过推理时修复(如提示工程和专门的报告修订子代理)来解决。