On The Fragility of Benchmark Contamination Detection in Reasoning Models

Leaderboards for LRMs have turned evaluation into a competition, incentivizing developers to optimize directly on benchmark suites. A shortcut to achieving higher rankings is to incorporate evaluation benchmarks into the training data, thereby yielding inflated performance, known as benchmark contamination. Surprisingly, our studies find that evading contamination detections for LRMs is alarmingly easy. We focus on the two scenarios where contamination may occur in practice: (I) when the base model evolves into LRM via SFT and RL, we find that contamination during SFT can be originally identified by contamination detection methods. Yet, even a brief GRPO training can markedly conceal contamination signals that most detection methods rely on. Further empirical experiments and theoretical analysis indicate that PPO style importance sampling and clipping objectives are the root cause of this detection concealment, indicating that a broad class of RL methods may inherently exhibit similar concealment capability; (II) when SFT contamination with CoT is applied to advanced LRMs as the final stage, most contamination detection methods perform near random guesses. Without exposure to non-members, contaminated LRMs would still have more confidence when responding to those unseen samples that share similar distributions to the training set, and thus, evade existing memorization-based detection methods. Together, our findings reveal the unique vulnerability of LRMs evaluations: Model developers could easily contaminate LRMs to achieve inflated leaderboards performance while leaving minimal traces of contamination, thereby strongly undermining the fairness of evaluation and threatening the integrity of public leaderboards. This underscores the urgent need for advanced contamination detection methods and trustworthy evaluation protocols tailored to LRMs.

翻译：大型推理模型（LRM）排行榜已将评估转变为竞赛，激励开发者直接在基准测试套件上进行优化。提升排名的一条捷径是将评估基准纳入训练数据，从而导致性能虚高，即所谓的基准污染。令人惊讶的是，我们的研究发现，让LRM逃避污染检测异常容易。我们聚焦于实践中可能发生污染的两种场景：（I）当基础模型通过监督微调（SFT）和强化学习（RL）演变为LRM时，我们发现SFT期间的污染原本可通过污染检测方法识别。然而，即使是短暂的GRPO训练也能显著掩盖大多数检测方法所依赖的污染信号。进一步的实证实验和理论分析表明，PPO风格的重要性采样和裁剪目标是这种检测掩盖的根本原因，这表明一大类RL方法可能天生具备类似的掩盖能力；（II）当将带有思维链（CoT）的SFT污染作为最终阶段应用于先进的LRM时，大多数污染检测方法的性能接近随机猜测。在没有接触非成员样本的情况下，受污染的LRM在响应那些与训练集分布相似但未见过的样本时，仍然会表现出更高的置信度，从而逃避现有的基于记忆的检测方法。综上所述，我们的研究揭示了LRM评估的独特脆弱性：模型开发者可以轻易地污染LRM以在排行榜上获得虚高的性能，同时留下极少的污染痕迹，这严重损害了评估的公平性并威胁公共排行榜的完整性。这凸显了迫切需要针对LRM开发先进的污染检测方法和可信赖的评估协议。