Deep research systems are widely used for multi-step web research, analysis, and cross-source synthesis, yet their evaluation remains challenging. Existing benchmarks often require annotation-intensive task construction, rely on static evaluation dimensions, or fail to reliably verify facts when citations are missing. To bridge these gaps, we introduce DeepResearchEval, an automated framework for deep research task construction and agentic evaluation. For task construction, we propose a persona-driven pipeline generating realistic, complex research tasks anchored in diverse user profiles, applying a two-stage filter Task Qualification and Search Necessity to retain only tasks requiring multi-source evidence integration and external retrieval. For evaluation, we propose an agentic pipeline with two components: an Adaptive Point-wise Quality Evaluation that dynamically derives task-specific evaluation dimensions, criteria, and weights conditioned on each generated task, and an Active Fact-Checking that autonomously extracts and verifies report statements via web search, even when citations are missing.
翻译:深度研究系统被广泛用于多步骤的网络研究、分析与跨来源综合,但其评估仍具挑战性。现有基准通常需要大量标注的任务构建,依赖于静态的评估维度,或在引用缺失时无法可靠地验证事实。为弥补这些差距,我们提出了DeepResearchEval,一个用于深度研究任务构建与智能体评估的自动化框架。在任务构建方面,我们提出了一个角色驱动的流水线,生成基于多样化用户画像的真实、复杂的研究任务,并应用一个两阶段过滤器——任务资格性与搜索必要性——以仅保留需要多源证据整合与外部检索的任务。在评估方面,我们提出了一个包含两个组件的智能体流水线:一个自适应逐点质量评估,它根据每个生成的任务动态推导出任务特定的评估维度、标准与权重;以及一个主动事实核查,它通过网页搜索自主提取并验证报告中的陈述,即使在引用缺失时也能进行。