As large language models advance, deep research systems capable of generating expert-level reports through multi-step reasoning and evidence-based synthesis are emerging. However, evaluating such reports remains challenging. Existing benchmarks often lack systematic evaluation criteria, rely heavily on LLM-based judges that may miss issues requiring expert judgment, and verify only a limited subset of explicitly cited statements rather than report-wide factual reliability. To address these limitations, we introduce DEER, a benchmark for evaluating expert-level deep research reports. DEER comprises 50 report-writing tasks spanning 13 domains, along with an expert-grounded evaluation taxonomy with seven dimensions and 25 subdimensions, operationalized into 101 fine-grained rubric items. To improve evaluation consistency, DEER provides task-specific Expert Evaluation Guidance to support LLM-based judging. Complementing rubric-based assessment, we propose a document-level fact-checking architecture that verifies both cited and uncited claims and quantifies the quality and reliability of the supporting evidence. Experimental results show that DEER exhibits strong correlation with human expert judgments and yields interpretable diagnostics of system strengths and weaknesses.
翻译:随着大语言模型的进步,能够通过多步推理和基于证据的综合分析生成专家级报告的深度研究系统正在兴起。然而,对此类报告的评估仍然具有挑战性。现有基准通常缺乏系统化的评估标准,过度依赖基于LLM的评判器(这些评判器可能遗漏需要专家判断的问题),并且仅验证明确引用的陈述中的有限子集,而非报告整体的真实性可靠性。为应对这些局限,我们提出了DEER,一个用于评估专家级深度研究报告的基准。DEER包含涵盖13个领域的50个报告撰写任务,以及一个包含七个维度和25个子维度的专家基础评估分类体系,该体系被具体化为101个细粒度评分项。为提高评估一致性,DEER提供了任务特定的专家评估指南,以支持基于LLM的评判。作为基于评分项评估的补充,我们提出了一种文档级事实核查架构,该架构可同时验证引用和未引用的主张,并对支持证据的质量和可靠性进行量化。实验结果表明,DEER与人类专家判断具有强相关性,并能对系统优势与劣势提供可解释的诊断分析。