We develop task scaling laws and model ladders to predict the individual task performance of pretrained language models (LMs) in the overtrained setting. Standard power laws for language modeling loss cannot accurately model task performance. Therefore, we leverage a two-step prediction approach: first use model and data size to predict a task-specific loss, and then use this task loss to predict task performance. We train a set of small-scale "ladder" models, collect data points to fit the parameterized functions of the two prediction steps, and make predictions for two target models: a 7B model trained to 4T tokens and a 13B model trained to 5T tokens. Training the ladder models only costs 1% of the compute used for the target models. On four multiple-choice tasks written in ranked classification format, we can predict the accuracy of both target models within 2 points of absolute error. We have higher prediction error on four other tasks (average absolute error 6.9) and find that these are often tasks with higher variance in task metrics. We also find that using less compute to train fewer ladder models tends to deteriorate predictions. Finally, we empirically show that our design choices and the two-step approach lead to superior performance in establishing scaling laws.
翻译:本文开发了任务缩放定律与模型阶梯,用于预测预训练语言模型(LMs)在过训练设定下的个体任务性能。标准的语言建模损失幂律无法准确建模任务性能。因此,我们采用两步预测方法:首先利用模型与数据规模预测任务特定损失,然后利用该任务损失预测任务性能。我们训练了一组小规模“阶梯”模型,收集数据点以拟合两个预测步骤的参数化函数,并对两个目标模型进行预测:一个训练至4T词元的7B模型与一个训练至5T词元的13B模型。训练阶梯模型仅消耗目标模型所需计算量的1%。在四个以排序分类格式编写的多项选择任务上,我们能够将两个目标模型的准确率预测误差控制在2个绝对误差点以内。在另外四个任务上我们观察到更高的预测误差(平均绝对误差6.9),并发现这些通常是任务指标方差较高的任务。我们还发现,使用更少计算资源训练更少的阶梯模型往往会导致预测性能下降。最后,我们通过实证表明,我们的设计选择与两步预测方法在建立缩放定律方面具有优越性能。