Deep Research Agents (DRAs) have demonstrated remarkable capabilities in autonomous information retrieval and report generation, showing great potential to assist humans in complex research tasks. Current evaluation frameworks primarily rely on LLM-generated references or LLM-derived evaluation dimensions. While these approaches offer scalability, they often lack the reliability of expert-verified content and struggle to provide objective, fine-grained assessments of critical dimensions. To bridge this gap, we introduce Wiki Live Challenge (WLC), a live benchmark that leverages the newest Wikipedia Good Articles (GAs) as expert-level references. Wikipedia's strict standards for neutrality, comprehensiveness, and verifiability serve as a great challenge for DRAs, with GAs representing the pinnacle of which. We curate a dataset of 100 recent Good Articles and propose Wiki Eval, a comprehensive evaluation framework comprising a fine-grained evaluation method with 39 criteria for writing quality and rigorous metrics for factual verifiability. Extensive experiments on various DRA systems demonstrate a significant gap between current DRAs and human expert-level Wikipedia articles, validating the effectiveness of WLC in advancing agent research. We release our benchmark at https://github.com/WangShao2000/Wiki_Live_Challenge
翻译:深度研究智能体(DRAs)在自主信息检索与报告生成方面展现出卓越能力,显示出协助人类完成复杂研究任务的巨大潜力。现有评估框架主要依赖大语言模型生成的参考文本或衍生的评估维度。虽然这些方法具有可扩展性,但往往缺乏专家验证内容的可靠性,且难以对关键维度提供客观、细粒度的评估。为弥补这一不足,我们提出维基实时挑战(WLC)——一个以最新维基百科优质条目(GAs)作为专家级参考的动态基准。维基百科对中立性、全面性和可验证性的严格标准对DRAs构成重大挑战,而优质条目正是这些标准的典范。我们构建了包含100篇近期优质条目的数据集,并提出维基评估(Wiki Eval)框架,该框架包含针对写作质量的39项细粒度评估标准,以及针对事实可验证性的严格度量指标。通过对多种DRA系统的大量实验,我们发现当前DRAs与人类专家级维基百科文章之间存在显著差距,验证了WLC在推进智能体研究方面的有效性。本基准已发布于 https://github.com/WangShao2000/Wiki_Live_Challenge