Large language models are trained on vast amounts of internet data, prompting concerns and speculation that they have memorized public benchmarks. Going from speculation to proof of contamination is challenging, as the pretraining data used by proprietary models are often not publicly accessible. We show that it is possible to provide provable guarantees of test set contamination in language models without access to pretraining data or model weights. Our approach leverages the fact that when there is no data contamination, all orderings of an exchangeable benchmark should be equally likely. In contrast, the tendency for language models to memorize example order means that a contaminated language model will find certain canonical orderings to be much more likely than others. Our test flags potential contamination whenever the likelihood of a canonically ordered benchmark dataset is significantly higher than the likelihood after shuffling the examples. We demonstrate that our procedure is sensitive enough to reliably prove test set contamination in challenging situations, including models as small as 1.4 billion parameters, on small test sets of only 1000 examples, and datasets that appear only a few times in the pretraining corpus. Using our test, we audit five popular publicly accessible language models for test set contamination and find little evidence for pervasive contamination.
翻译:大型语言模型在海量互联网数据上训练,引发了关于其记忆公开基准测试集的担忧与猜测。从猜测转向污染证明面临挑战,因为专有模型使用的预训练数据通常不公开。我们证明,在无需访问预训练数据或模型权重的情况下,有可能为语言模型中的测试集污染提供可证明的保证。我们的方法利用了一个事实:当不存在数据污染时,可交换基准测试集的所有顺序排列应当具有同等可能性。相反,语言模型倾向于记忆示例顺序,这意味着受污染的语言模型会认为某些规范排列的可能性远高于其他排列。当规范排列的基准数据集的可能性显著高于打乱示例后的可能性时,我们的测试会标记潜在污染。我们证明,该流程足够灵敏,能在严峻场景下可靠证明测试集污染,包括参数规模小至14亿的模型、仅含1000个示例的小规模测试集,以及仅在预训练语料库中出现数次的数据集。通过该测试,我们对五种流行的公开可访问语言模型进行了测试集污染审计,发现普遍污染的迹象极少。