We study a fundamental problem in the evaluation of large language models that we call training on the test task. Unlike wrongful practices like training on the test data, leakage, or data contamination, training on the test task is not a malpractice. Rather, the term describes a growing set of techniques to include task-relevant data in the pretraining stage of a language model. We demonstrate that training on the test task confounds both relative model evaluations and claims about emergent capabilities. We argue that the seeming superiority of one model family over another may be explained by a different degree of training on the test task. To this end, we propose an effective method to adjust for training on the test task by fine-tuning each model under comparison on the same task-relevant data before evaluation. We then show that instances of emergent behavior largely vanish once we adjust for training on the test task. This also applies to reported instances of emergent behavior that cannot be explained by the choice of evaluation metric. Our work promotes a new perspective on the evaluation of large language models with broad implications for benchmarking and the study of emergent capabilities.
翻译:我们研究大语言模型评估中的一个基本问题,称为"在测试任务上训练"。与在测试数据上训练、数据泄露或数据污染等不当做法不同,在测试任务上训练并非不当行为。该术语描述的是在语言模型预训练阶段纳入任务相关数据的一系列日益增长的技术。我们证明,在测试任务上训练会同时混淆相对模型评估和关于涌现能力的论断。我们认为,一个模型系列相对于另一个模型系列的表观优势,可能源于在测试任务上训练程度的差异。为此,我们提出一种有效方法:通过在评估前对每个参与比较的模型使用相同的任务相关数据进行微调,来校正"在测试任务上训练"的影响。随后我们证明,一旦校正了"在测试任务上训练"的影响,涌现行为的实例就会基本消失。这也适用于那些无法用评估指标选择来解释的已报道涌现行为实例。我们的研究为大语言模型评估提供了一个新视角,对基准测试和涌现能力研究具有广泛意义。