As large language models achieve impressive scores on traditional benchmarks, an increasing number of researchers are becoming concerned about benchmark data leakage during pre-training, commonly known as the data contamination problem. To ensure fair evaluation, recent benchmarks release only the training and validation sets, keeping the test set labels closed-source. They require anyone wishing to evaluate his language model to submit the model's predictions for centralized processing and then publish the model's result on their leaderboard. However, this submission process is inefficient and prevents effective error analysis. To address this issue, we propose to variabilize benchmarks and evaluate language models dynamically. Specifically, we extract variables from each test case and define a value range for each variable. For each evaluation, we sample new values from these value ranges to create unique test cases, thus ensuring a fresh evaluation each time. We applied this variable perturbation method to four datasets: GSM8K, ARC, CommonsenseQA, and TruthfulQA, which cover mathematical generation and multiple-choice tasks. Our experimental results demonstrate that this approach provides a more accurate assessment of the true capabilities of language models, effectively mitigating the contamination problem.
翻译:随着大型语言模型在传统基准测试中取得令人瞩目的分数,越来越多的研究者开始担忧预训练过程中基准数据的泄露问题,即通常所说的数据污染问题。为确保公平评估,近期发布的基准测试仅公开训练集和验证集,而将测试集标签保持闭源。任何希望评估其语言模型的研究者需提交模型预测结果进行集中处理,随后由平台在排行榜上公布模型成绩。然而,这种提交流程效率低下,且阻碍了有效的误差分析。为解决此问题,我们提出对基准测试进行变量化改造并动态评估语言模型。具体而言,我们从每个测试案例中提取变量并为每个变量定义取值范围。每次评估时,我们从这些取值范围中采样新值以生成独特的测试案例,从而确保每次评估都具有新鲜性。我们将此变量扰动方法应用于四个数据集:GSM8K、ARC、CommonsenseQA和TruthfulQA,涵盖数学生成与多项选择任务。实验结果表明,该方法能更准确地评估语言模型的真实能力,有效缓解数据污染问题。