Training on the Test Task Confounds Evaluation and Emergence

We study a fundamental problem in the evaluation of large language models that we call training on the test task. Unlike wrongful practices like training on the test data, leakage, or data contamination, training on the test task is not a malpractice. Rather, the term describes a growing set of techniques to include task-relevant data in the pretraining stage of a language model. We demonstrate that training on the test task confounds both relative model evaluations and claims about emergent capabilities. We argue that the seeming superiority of one model family over another may be explained by a different degree of training on the test task. To this end, we propose an effective method to adjust for training on the test task by fine-tuning each model under comparison on the same task-relevant data before evaluation. We then show that instances of emergent behavior largely vanish once we adjust for training on the test task. This also applies to reported instances of emergent behavior that cannot be explained by the choice of evaluation metric. Our work promotes a new perspective on the evaluation of large language models with broad implications for benchmarking and the study of emergent capabilities.

翻译：我们研究大语言模型评估中的一个基本问题，称为"在测试任务上训练"。与在测试数据上训练、数据泄露或数据污染等不当做法不同，在测试任务上训练并非不当行为。该术语描述的是在语言模型预训练阶段纳入任务相关数据的一系列日益增长的技术。我们证明，在测试任务上训练会同时混淆相对模型评估和关于涌现能力的论断。我们认为，一个模型系列相对于另一个模型系列的表观优势，可能源于在测试任务上训练程度的差异。为此，我们提出一种有效方法：通过在评估前对每个参与比较的模型使用相同的任务相关数据进行微调，来校正"在测试任务上训练"的影响。随后我们证明，一旦校正了"在测试任务上训练"的影响，涌现行为的实例就会基本消失。这也适用于那些无法用评估指标选择来解释的已报道涌现行为实例。我们的研究为大语言模型评估提供了一个新视角，对基准测试和涌现能力研究具有广泛意义。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/