Evaluating large language models (LLMs) is increasingly confounded by \emph{variant contamination}: the training corpus contains semantically equivalent yet lexically or syntactically altered versions of test items. Unlike verbatim leakage, these paraphrased or structurally transformed variants evade existing detectors based on sampling consistency or perplexity, thereby inflating benchmark scores via memorization rather than genuine reasoning. We formalize this problem and introduce \textbf{DVD} (\textbf{D}etection via \textbf{V}ariance of generation \textbf{D}istribution), a single-sample detector that models the local output distribution induced by temperature sampling. Our key insight is that contaminated items trigger alternation between a \emph{memory-adherence} state and a \emph{perturbation-drift} state, yielding abnormally high variance in the synthetic difficulty of low-probability tokens; uncontaminated items remain in drift with comparatively smooth variance. We construct the first benchmark for variant contamination across two domains Omni-MATH and SuperGPQA by generating and filtering semantically equivalent variants, and simulate contamination via fine-tuning models of different scales and architectures (Qwen2.5 and Llama3.1). Across datasets and models, \textbf{DVD} consistently outperforms perplexity-based, Min-$k$\%++, edit-distance (CDD), and embedding-similarity baselines, while exhibiting strong robustness to hyperparameters. Our results establish variance of the generation distribution as a principled and practical fingerprint for detecting variant contamination in LLM evaluation.
翻译:大语言模型(LLM)的评估正日益受到**变体污染**的困扰:训练语料中包含与测试项语义等价但词汇或句法发生改变的版本。与逐字泄露不同,这些经过释义或结构转换的变体能规避基于采样一致性或困惑度的现有检测方法,从而通过记忆而非真实推理虚增基准分数。我们形式化了这一问题,并提出 **DVD**(**D**etection via **V**ariance of generation **D**istribution),一种通过温度采样建模局部输出分布的单样本检测器。我们的核心洞见是:受污染项会触发**记忆遵从**状态与**扰动漂移**状态之间的交替,导致低概率词元的合成难度出现异常高方差;而未受污染项则保持在漂移状态,其方差相对平滑。我们通过生成并筛选语义等价变体,构建了首个跨Omni-MATH和SuperGPQA两个领域的变体污染基准,并通过微调不同规模和架构的模型(Qwen2.5与Llama3.1)来模拟污染过程。在不同数据集和模型上,**DVD** 的表现始终优于基于困惑度、Min-$k$\%++、编辑距离(CDD)以及嵌入相似度的基线方法,同时对超参数展现出很强的鲁棒性。我们的研究结果表明,生成分布的方差是检测LLM评估中变体污染问题的一种原理清晰且实用的特征指标。