The advancement of large language models (LLMs) relies on evaluation using public benchmarks, but data contamination can lead to overestimated performance. Previous researches focus on detecting contamination by determining whether the model has seen the exact same data during training. In this work, we argue that even training on data similar to benchmark data inflates performance on in-distribution tasks without improving overall capacity, which we called In-distribution contamination. To effectively detect in-distribution contamination, we propose DICE, a novel method that leverages the internal states of LLMs to locate-then-detect the contamination. DICE first identifies the most sensitive layer to contamination, then trains a classifier based on the internal states of that layer. Experiments reveal DICE's high accuracy in detecting in-distribution contamination across various LLMs and math reasoning datasets. We also show the generalization capability of the trained DICE detector, which is able to detect contamination across multiple benchmarks with similar distributions. Additionally, we find that the DICE detection scores are positively correlated with the performance of ten LLMs fine-tuned by either us or other organizations on four math reasoning datasets (with $R^2$ values between 0.6 and 0.75). This indicates that the in-distribution contamination problem potentially lead to an overestimation of the true capabilities of many existing models. The code and data are available at https://github.com/THU-KEG/DICE.
翻译:大型语言模型(LLM)的发展依赖于使用公共基准进行评估,但数据污染可能导致性能被高估。先前的研究主要通过判断模型在训练中是否见过完全相同的数据来检测污染。本工作中,我们认为即使训练数据与基准数据相似,也会在不提升整体能力的情况下,夸大模型在分布内任务上的表现,我们称之为分布内污染。为有效检测分布内污染,我们提出了DICE——一种利用LLM内部状态进行定位-检测污染的新方法。DICE首先识别对污染最敏感的层,随后基于该层内部状态训练分类器。实验表明,DICE在不同LLM和数学推理数据集上检测分布内污染具有高精度。我们还展示了训练后的DICE检测器的泛化能力,其能够检测具有相似分布的多个基准数据集的污染。此外,我们发现DICE检测分数与我们或其他机构在四个数学推理数据集上微调的十个LLM的性能呈正相关($R^2$值介于0.6至0.75之间)。这表明分布内污染问题可能导致对许多现有模型真实能力的高估。代码与数据公开于https://github.com/THU-KEG/DICE。