We study a fundamental problem in the evaluation of large language models that we call training on the test task. Unlike wrongful practices like training on the test data, leakage, or data contamination, training on the test task is not a malpractice. Rather, the term describes a growing set of practices that utilize knowledge about evaluation tasks at training time. We demonstrate that training on the test task confounds both relative model evaluations and claims about emergent capabilities. We argue that the seeming superiority of one model family over another may be explained by a different degree of training on the test task. To this end, we propose an effective method to adjust for the effect of training on the test task on benchmark evaluations. Put simply, to fine-tune each model under comparison on the same task-relevant data before evaluation. We then show that instances of emergent behavior disappear gradually as models train on the test task. Our work promotes a new perspective on the evaluation of large language models with broad implications for benchmarking and the study of emergent capabilities
翻译:我们研究大型语言模型评估中的一个基本问题,称之为“在测试任务上训练”。与在测试数据上训练、数据泄露或数据污染等不当行为不同,在测试任务上训练并非不当操作。该术语描述的是一系列日益增多的实践,即在训练阶段利用关于评估任务的知识。我们证明,在测试任务上训练既混淆了模型间的相对评估,也干扰了对涌现能力的论断。我们认为,一个模型族相对于另一个模型族表现出的表面优势,可能源于其在测试任务上训练程度的差异。为此,我们提出一种有效方法,用于在基准评估中校正测试任务训练带来的影响。简而言之,即在评估前对每个参与比较的模型在同一任务相关数据上进行微调。随后我们展示,随着模型在测试任务上接受训练,涌现行为的实例会逐渐消失。我们的研究为大型语言模型评估提供了一个新视角,对基准测试和涌现能力研究具有广泛意义。