Public benchmarks enable fair and reproducible evaluation of LLM reasoning, but they become fragile for deep research agents that actively search the web during inference. Such agents may retrieve public benchmark metadata, question context, or even ground-truth answers via web search. This gives rise to Search-Time Contamination (STC), where external retrieval bypasses intended reasoning and inflates measured performance. We systematically study STC in deep research agent evaluation. We define three contamination types with increasing severity, namely Benchmark Metadata Leakage, Question-Context Leakage, and Explicit Answer Leakage, and develop detection algorithms to identify them and quantify their impact on agent performance. Evaluating modern deep research agents on six public benchmarks, we find that STC is widespread and can inflate performance by up to 4%. Our findings show that existing evaluations may overestimate true reasoning ability. We therefore advocate contamination-aware practices, including isolated sandboxes, transparent search trajectories, and controlled benchmark access.
翻译:公共基准使大语言模型推理的公平且可复现评估成为可能,但对于在推理过程中主动搜索网络的深度研究代理而言,这些基准变得脆弱。此类代理可能通过网络搜索检索到公共基准元数据、问题上下文甚至真实答案。这引发了搜索时污染现象,即外部检索绕过预期推理过程并夸大测量性能。我们系统研究了深度研究代理评估中的搜索时污染现象。定义了三种严重程度递增的污染类型——即基准元数据泄露、问题上下文泄露和显式答案泄露,并开发了检测算法以识别这些污染并量化其对代理性能的影响。在六个公共基准上评估现代深度研究代理时,我们发现搜索时污染普遍存在,可使性能虚增达4%。研究结果表明,现有评估可能高估了真实推理能力。因此我们倡导污染感知实践,包括隔离沙箱、透明搜索轨迹和受控基准访问。