The literature on test set contamination largely focuses on detection, but the correction of contaminated test scores is underexplored. Our core proposal is to spike the training data by intentionally contaminating some test examples at known rates. The spiked examples can then be used to calibrate predictors of model memorization which enable principled statistical correction of inflated test scores. To evaluate different correction estimators, we first present a simulation framework based on the Hubble models. Hubble models come in minimal pairs, where the perturbed model was deliberately contaminated with several test sets, while the standard model was not, serving as the counterfactual and correction target. We consider estimators that use information from a memorization predictor, correctness predictor, or both. In simulation, we establish basic statistical intuitions and show that estimators leveraging memorization and correctness information are better than naive estimation which makes no correction at all. We then instantiate several memorization and correctness predictors, and find that simple predictors such as Platt-scaled membership inference metrics provide good signal for correction. Finally, we examine the practical considerations of spiking. Simple memorization predictors need no more than 10 examples for calibration and often transfer from one dataset to another. Taken together, spiking is a promising solution for test set contamination.
翻译:关于测试集污染的文献主要聚焦于检测,但受污染测试分数的校正问题尚未得到充分探索。我们的核心方案是通过以已知比例故意污染部分测试样本,对训练数据进行受污染数据增强。这些受污染样本可用于校准预测模型记忆度的指标,从而实现对虚增测试分数的原则性统计校正。为评估不同校正估计器,我们首先基于哈勃模型构建了模拟框架。哈勃模型以最小配对形式存在:扰动模型被故意用多个测试集污染,而标准模型未被污染,作为反事实和校正目标。我们考虑了使用记忆预测器、正确性预测器或两者组合信息的估计器。通过模拟实验,我们建立了基本统计直觉,并表明利用记忆度和正确性信息的估计器优于完全不进行校正的朴素估计。随后我们实例化了多种记忆度与正确性预测器,发现简单预测器(如基于普拉特标度的成员推理指标)能为校正提供良好信号。最后,我们探讨了受污染数据增强的实际考量:简单记忆预测器仅需不超过10个样本进行校准,且常可在不同数据集间迁移。综合而言,受污染数据增强为测试集污染问题提供了一种有前景的解决方案。