Data contamination poses a significant threat to the reliable evaluation of Large Language Models (LLMs). This issue arises when benchmark samples may inadvertently appear in training sets, compromising the validity of reported performance. While detection methods have been developed for the pre-training and Supervised Fine-Tuning stages, a critical research gap exists for the increasingly significant phase of Reinforcement Learning (RL) post-training. As RL post-training becomes pivotal for advancing LLM reasoning, the absence of specialized contamination detection methods in this paradigm presents a critical vulnerability. To address this, we conduct the first systematic study of data detection within RL post-training scenario and propose Self-Critique. Our method is motivated by a key observation: after RL phase, the output entropy distribution of LLMs tends to collapse into highly specific and sparse modes. Self-Critique probes for the underlying policy collapse, i.e., the model's convergence to a narrow reasoning path, which causes this entropy reduction. To facilitate this research, we also introduce RL-MIA, a benchmark constructed to simulate this specific contamination scenario. Extensive experiments show that Self-Critique significantly outperforms baseline methods across multiple models and contamination tasks, achieving an AUC improvement of up to 30%. Whereas existing methods are close to a random guess for RL-phase contamination, our method makes detection possible.
翻译:数据污染对大型语言模型(LLMs)的可靠评估构成重大威胁。当基准测试样本可能无意中出现在训练集中时,就会引发这一问题,从而损害所报告性能的有效性。尽管针对预训练和监督微调阶段已开发出检测方法,但对于日益重要的强化学习(RL)后训练阶段,仍存在关键的研究空白。随着RL后训练在推进LLM推理能力方面变得至关重要,该范式下缺乏专门的污染检测方法构成了严重的脆弱性。为解决此问题,我们首次对RL后训练场景下的数据检测进行了系统性研究,并提出了Self-Critique方法。我们的方法基于一个关键观察:经过RL阶段后,LLMs的输出熵分布倾向于坍缩为高度特定且稀疏的模态。Self-Critique通过探测潜在的策略坍缩(即模型收敛至狭窄的推理路径)来揭示导致熵减少的机制。为促进该研究,我们还构建了RL-MIA基准测试集,用于模拟这一特定的污染场景。大量实验表明,Self-Critique在多个模型和污染任务中显著优于基线方法,AUC提升最高达30%。现有方法对RL阶段污染的检测效果接近随机猜测,而我们的方法使有效检测成为可能。