Benchmark-based evaluation is the de facto standard for comparing large language models (LLMs). However, its reliability is increasingly threatened by test set contamination, where test samples or their close variants leak into training data and artificially inflate reported performance. To address this issue, prior work has explored two main lines of mitigation. One line attempts to identify and remove contaminated benchmark items before evaluation, but this inevitably alters the evaluation set itself and becomes unreliable when contamination is moderate or severe. The other line preserves the benchmark and instead suppresses contaminated behavior at evaluation time; however, such interventions often interfere with normal inference and lead to noticeable performance degradation on clean inputs. We propose DeconIEP, a decontamination framework that operates entirely during evaluation by applying small, bounded perturbations in the input embedding space. Guided by a relatively less-contaminated reference model, DeconIEP learns an instance-adaptive perturbation generator that steers the evaluated model away from memorization-driven shortcut pathways. Across multiple open-weight LLMs and benchmarks, extensive empirical results show that DeconIEP achieves strong decontamination effectiveness while incurring only minimal degradation in benign utility.
翻译:基于基准的评估已成为比较大型语言模型(LLM)的事实标准。然而,其可靠性正日益受到测试集污染的威胁,即测试样本或其近似变体泄露至训练数据中,从而人为地夸大报告的性能。为解决此问题,先前研究主要探索了两类缓解路径。一类方法试图在评估前识别并移除受污染的基准项目,但这不可避免地改变了评估集本身,且在污染程度中等或严重时变得不可靠。另一类方法则保留基准,转而在评估时抑制受污染行为;然而,此类干预通常会干扰正常推理,导致在干净输入上出现明显的性能下降。我们提出DeconIEP,一种完全在评估阶段运行的去污染框架,通过在输入嵌入空间中施加有界的小扰动来实现。在相对较少受污染的参考模型引导下,DeconIEP学习一个实例自适应的扰动生成器,使被评估模型远离由记忆驱动的捷径路径。在多个开源权重LLM和基准测试上的大量实验结果表明,DeconIEP在实现强大去污染效果的同时,仅对良性效用造成极小的性能损失。