As an embodiment of intelligence evolution toward interconnected architectures, Deep Research Agents (DRAs) systematically exhibit the capabilities in task decomposition, cross-source retrieval, multi-stage reasoning, information integration, and structured output, which markedly enhance performance on complex and open-ended tasks. However, existing benchmarks remain deficient in evaluation dimensions, response format, and scoring mechanisms, limiting their effectiveness in assessing such agents. This paper introduces Dr. Bench, a multidimensional evaluation framework tailored to DRAs and long-form report-style responses. The benchmark comprises 214 expert-curated challenging tasks across 10 broad domains, each accompanied by manually constructed reference bundles to support composite evaluation. This framework incorporates metrics for semantic quality, topical focus, and retrieval trustworthiness, enabling a comprehensive evaluation of long reports generated by DRAs. Extensive experimentation confirms the superior performance of mainstream DRAs over web-search-tool-augmented reasoning models, yet reveals considerable scope for further improvement. This study provides a robust foundation for capability assessment, architectural refinement, and paradigm advancement of DRAs.
翻译:作为智能向互联架构演进的具体体现,深度研究智能体(DRAs)系统性地展现出任务分解、跨源检索、多阶段推理、信息整合与结构化输出的能力,显著提升了在复杂开放任务上的性能。然而,现有基准在评估维度、响应格式与评分机制方面仍存在不足,限制了对此类智能体评估的有效性。本文提出Dr. Bench,一个专为DRAs及长篇幅报告式响应设计的多维评估框架。该基准包含10个广泛领域的214项专家构建的挑战性任务,每项任务均配有手工构建的参考资源包以支持复合评估。该框架整合了语义质量、主题聚焦度与检索可信度等多维度指标,能够对DRAs生成的长篇报告进行全面评估。大量实验证实主流DRAs的性能优于基于网络搜索工具增强的推理模型,但也揭示了其仍有巨大的改进空间。本研究为DRAs的能力评估、架构优化与范式演进提供了坚实基础。