As frontier AI systems are pretrained on web-scale data, test set contamination has become a critical concern for accurately assessing their capabilities. While research has thoroughly investigated the impact of test set contamination on discriminative evaluations like multiple-choice question-answering, comparatively little research has studied the impact of test set contamination on generative evaluations. In this work, we quantitatively assess the effect of test set contamination on generative evaluations through the language model lifecycle. We pretrain language models on mixtures of web data and the MATH benchmark, sweeping model sizes and number of test set replicas contaminating the pretraining corpus; performance improves with contamination and model size. Using scaling laws, we make a surprising discovery: including even a single test set replica enables models to achieve lower loss than the irreducible error of training on the uncontaminated corpus. We then study further training: overtraining with fresh data reduces the effects of contamination, whereas supervised finetuning on the training set can either increase or decrease performance on test data, depending on the amount of pretraining contamination. Finally, at inference, we identify factors that modulate memorization: high sampling temperatures mitigate contamination effects, and longer solutions are exponentially more difficult to memorize than shorter ones, presenting a contrast with discriminative evaluations, where solutions are only a few tokens in length. By characterizing how generation and memorization interact, we highlight a new layer of complexity for trustworthy evaluation of AI systems.
翻译:随着前沿人工智能系统在网页规模数据上进行预训练,测试集污染已成为准确评估其能力的关键问题。尽管研究已深入探讨了测试集污染对判别式评估(如多项选择题回答)的影响,但针对测试集污染对生成式评估影响的研究相对较少。在本工作中,我们通过语言模型的生命周期定量评估了测试集污染对生成式评估的影响。我们在网络数据和MATH基准测试的混合数据上预训练语言模型,系统调整模型规模以及污染预训练语料库的测试集副本数量;性能随污染程度和模型规模的增加而提升。利用缩放定律,我们得出了一个惊人发现:即使仅包含一个测试集副本,也能使模型获得比在未污染语料库上训练的不可约误差更低的损失。随后我们研究了进一步训练:使用新数据进行过度训练可减轻污染的影响,而在训练集上进行监督微调则可能提升或降低测试数据上的性能,具体取决于预训练污染的程度。最后,在推理阶段,我们识别了调节记忆效应的因素:高采样温度可缓解污染效应,且较长解决方案的记忆难度呈指数级高于较短方案,这与判别式评估形成对比——后者的解决方案长度通常仅为数个词元。通过刻画生成与记忆的交互机制,我们揭示了人工智能系统可信评估中一个新的复杂性层面。