Large language models (LLMs) are known to be trained on vast amounts of data, which may unintentionally or intentionally include data from commonly used benchmarks. This inclusion can lead to cheatingly high scores on model leaderboards, yet result in disappointing performance in real-world applications. To address this benchmark contamination problem, we first propose a set of requirements that practical contamination detection methods should follow. Following these proposed requirements, we introduce PaCoST, a Paired Confidence Significance Testing to effectively detect benchmark contamination in LLMs. Our method constructs a counterpart for each piece of data with the same distribution, and performs statistical analysis of the corresponding confidence to test whether the model is significantly more confident under the original benchmark. We validate the effectiveness of PaCoST and apply it on popular open-source models and benchmarks. We find that almost all models and benchmarks we tested are suspected contaminated more or less. We finally call for new LLM evaluation methods.
翻译:众所周知,大语言模型(LLMs)是在海量数据上训练的,这些数据可能无意或有意地包含了来自常用基准测试的数据。这种包含会导致模型在排行榜上获得作弊性的高分,却在真实世界应用中表现令人失望。为解决这一基准污染问题,我们首先提出了一套实用污染检测方法应遵循的要求。依据这些要求,我们提出了PaCoST——一种基于配对置信度显著性检验的方法,用于有效检测大语言模型中的基准污染。我们的方法为每条数据构建一个具有相同分布的对应样本,并对相应的置信度进行统计分析,以检验模型在原始基准测试下是否显著更为自信。我们验证了PaCoST的有效性,并将其应用于流行的开源模型和基准测试。我们发现,几乎所有测试的模型和基准都存在不同程度的疑似污染。我们最终呼吁开发新的大语言模型评估方法。