Language models scale reliably with over-training and on downstream tasks

Samir Yitzhak Gadre,Georgios Smyrnis,Vaishaal Shankar,Suchin Gururangan,Mitchell Wortsman,Rulin Shao,Jean Mercat,Alex Fang,Jeffrey Li,Sedrick Keh,Rui Xin,Marianna Nezhurina,Igor Vasiljevic,Jenia Jitsev,Alexandros G. Dimakis,Gabriel Ilharco,Shuran Song,Thomas Kollar,Yair Carmon,Achal Dave,Reinhard Heckel,Niklas Muennighoff,Ludwig Schmidt

Scaling laws are useful guides for developing language models, but there are still gaps between current scaling studies and how language models are ultimately trained and evaluated. For instance, scaling is usually studied in the compute-optimal training regime (i.e., "Chinchilla optimal" regime); however, in practice, models are often over-trained to reduce inference costs. Moreover, scaling laws mostly predict loss on next-token prediction, but ultimately models are compared based on downstream task performance. In this paper, we address both shortcomings. To do so, we create a testbed of 104 models with 0.011B to 6.9B parameters trained with various numbers of tokens on three data distributions. First, we investigate scaling in the over-trained regime. We fit scaling laws that extrapolate in both the number of model parameters and the ratio of training tokens to parameters. This enables us to predict the validation loss of a 1.4B parameter, 900B token run (i.e., 32$\times$ over-trained) and a 6.9B parameter, 138B token run$\unicode{x2014}$each from experiments that take 300$\times$ less compute. Second, we relate the perplexity of a language model to its downstream task performance via a power law. We use this law to predict top-1 error averaged over downstream tasks for the two aforementioned models using experiments that take 20$\times$ less compute. Our experiments are available at https://github.com/mlfoundations/scaling.

翻译：缩放定律是开发语言模型的有用指南，但当前缩放研究与语言模型最终训练和评估方式之间仍存在差距。例如，缩放通常是在计算最优训练机制（即"Chinchilla最优"机制）下研究的；然而在实践中，为降低推理成本，模型常常被过度训练。此外，缩放定律主要预测下一个词预测的损失，但模型最终通过下游任务性能进行比较。本文针对这两个不足之处展开研究。为此，我们构建了一个包含104个模型的测试平台，参数范围从0.011B到6.9B，并在三种数据分布上使用不同数量的令牌进行训练。首先，我们研究过度训练机制下的缩放规律。我们拟合了在模型参数数量和训练令牌与参数比率两个维度上均可外推的缩放定律。这使得我们能够预测一个1.4B参数、900B令牌的训练（即32倍过度训练）和一个6.9B参数、138B令牌的训练——每个预测均基于计算量减少300倍的实验。其次，我们通过幂律将语言模型的困惑度与其下游任务性能联系起来。利用此定律，我们基于计算量减少20倍的实验，预测了上述两个模型在下游任务上的平均top-1误差。我们的实验可在 https://github.com/mlfoundations/scaling 获取。