Training on the Test Task Confounds Evaluation and Emergence

We study a fundamental problem in the evaluation of large language models that we call training on the test task. Unlike wrongful practices like training on the test data, leakage, or data contamination, training on the test task is not a malpractice. Rather, the term describes a growing set of practices that utilize knowledge about evaluation tasks at training time. We demonstrate that training on the test task confounds both relative model evaluations and claims about emergent capabilities. We argue that the seeming superiority of one model family over another may be explained by a different degree of training on the test task. To this end, we propose an effective method to adjust for the effect of training on the test task on benchmark evaluations. Put simply, to fine-tune each model under comparison on the same task-relevant data before evaluation. We then show that instances of emergent behavior disappear gradually as models train on the test task. Our work promotes a new perspective on the evaluation of large language models with broad implications for benchmarking and the study of emergent capabilities

翻译：我们研究大型语言模型评估中的一个基本问题，称之为“在测试任务上训练”。与在测试数据上训练、数据泄露或数据污染等不当行为不同，在测试任务上训练并非不当操作。该术语描述的是一系列日益增多的实践，即在训练阶段利用关于评估任务的知识。我们证明，在测试任务上训练既混淆了模型间的相对评估，也干扰了对涌现能力的论断。我们认为，一个模型族相对于另一个模型族表现出的表面优势，可能源于其在测试任务上训练程度的差异。为此，我们提出一种有效方法，用于在基准评估中校正测试任务训练带来的影响。简而言之，即在评估前对每个参与比较的模型在同一任务相关数据上进行微调。随后我们展示，随着模型在测试任务上接受训练，涌现行为的实例会逐渐消失。我们的研究为大型语言模型评估提供了一个新视角，对基准测试和涌现能力研究具有广泛意义。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/