Search Agents -- large language models augmented with search tools -- have intensified the need for future-proof evaluation benchmarks. Existing benchmarks such as BrowseComp rely on static knowledge, making them vulnerable to test-set contamination and parametric memorization. Consequently, models can achieve high scores through fact recall rather than genuine retrieval, obscuring true browsing competence via reasoning shortcuts. In this paper, we introduce EvoBrowseComp, an evolving benchmark of 400 English and 400 Chinese contamination-free complex questions synthesized via live-web traversal. To collect these questions, we design a three-agent collaborative framework: (1) a QA synthesis agent that retrieves fresh knowledge from the live web to synthesize QA pairs; (2) an information filtering agent that filters retrieved knowledge in terms of credibility and popularity to block parametric shortcuts; and (3) a high-level guidance agent that formalizes questions into reasoning graphs to reduce logical redundancy and shortcuts in synthesized QA pairs. Because the framework supports fully automated synthesis, EvoBrowseComp can be regularly updated to prevent data contamination and maintain temporal freshness. Extensive experiments confirm its great difficulty, requiring broad horizontal search. It establishes a scalable paradigm for auto-updatable, high-difficulty benchmarking that keeps pace with both evolving world knowledge and advancing agent capabilities.
翻译:搜索体——即配备搜索工具的大语言模型——加剧了对具备前瞻性评估基准的需求。现有基准如BrowseComp依赖静态知识,易受测试集污染和参数化记忆的影响,导致模型可能通过事实回忆而非真实检索获得高分,从而因推理捷径掩盖真实浏览能力。本文提出EvoBrowseComp,一个由400个英语和400个中文无污染复杂问题构成的演化基准,通过实时网络遍历采集。为收集这些问题,我们设计了三体协作框架:(1)QA合成体,从实时网络检索新知识以合成问答对;(2)信息过滤体,根据可信度和流行度过滤检索知识,阻断参数化捷径;(3)高级指导体,将问题形式化为推理图,减少合成问答对中的逻辑冗余与捷径。由于该框架支持全自动合成,EvoBrowseComp可定期更新,防止数据污染并保持时间新鲜度。大量实验证实其高难度,需广泛水平搜索。这为可自动更新、高难度的基准测试建立了一个可扩展范式,使其与不断演化的世界知识和增强的体能力同步。