Deep research agents have emerged as LLM-based systems designed to perform multi-step information seeking and reasoning over large, open-domain sources to answer complex questions by synthesizing information from multiple information sources. Given the complexity of the task and despite various recent efforts, evaluation of deep research agents remains fundamentally challenging. This paper identifies a list of requirements and optional properties for evaluating deep research agents. We observe that existing benchmarks do not satisfy all identified requirements. Inspired by prior research on TREC Total Recall Tracks, we introduce the task of Total Recall Question Answering and develop a framework for deep research agents evaluation that satisfies the identified criteria. Our framework constructs single-answer, total recall queries with precise evaluation and relevance judgments derived from a structured knowledge base paired with a text corpus, enabling large-scale data construction. Using this framework, we build TRQA, a deep research benchmark constructed from Wikidata-Wikipedia as a real-world source and a synthetically generated e-commerce knowledge base and corpus to mitigate the effects of data contamination. We benchmark the collection with representative retriever and deep research models and establish baseline retrieval and end-to-end results for future comparative evaluation.
翻译:深度研究智能体作为基于大语言模型的系统崭露头角,旨在通过大规模开放域资料执行多步信息检索与推理,综合多源信息回答复杂问题。鉴于该任务的复杂性,尽管近期已有诸多尝试,深度研究智能体的评估仍面临根本性挑战。本文明确了评估深度研究智能体所需的一系列必备属性与可选特性。我们观察到现有基准测试未能满足所有已识别需求。受TREC全面回顾任务系列研究启发,我们提出全面回顾问答任务,并构建了满足既定标准的深度研究智能体评估框架。该框架通过从结构化知识库与文本语料库的配对中生成单答案全面回顾查询,实现精确评估与相关性判断,支持大规模数据构建。基于该框架,我们构建了TRQA——一个以维基数据-维基百科为真实世界数据源、辅以合成生成的电子商务知识库与语料库的深度研究基准,以降低数据污染影响。我们选取代表性检索模型与深度研究模型对基准进行评估,建立基线检索与端到端结果,供后续对比研究参考。