While scaling laws provide a reliable methodology for predicting train loss across compute scales for a single data distribution, less is known about how these predictions should change as we change the distribution. In this paper, we derive a strategy for predicting one loss from another and apply it to predict across different pre-training datasets and from pre-training data to downstream task data. Our predictions extrapolate well even at 20x the largest FLOP budget used to fit the curves. More precisely, we find that there are simple shifted power law relationships between (1) the train losses of two models trained on two separate datasets when the models are paired by training compute (train-to-train), (2) the train loss and the test loss on any downstream distribution for a single model (train-to-test), and (3) the test losses of two models trained on two separate train datasets (test-to-test). The results hold up for pre-training datasets that differ substantially (some are entirely code and others have no code at all) and across a variety of downstream tasks. Finally, we find that in some settings these shifted power law relationships can yield more accurate predictions than extrapolating single-dataset scaling laws.
翻译:尽管缩放定律为单一数据分布下跨计算规模预测训练损失提供了可靠方法,但对于分布变化时这些预测应如何调整,目前认知尚浅。本文提出一种从一种损失预测另一种损失的策略,并将其应用于不同预训练数据集之间以及从预训练数据到下游任务数据的预测。即使在使用曲线拟合最大FLOP预算20倍的规模上,我们的预测仍能保持良好外推性能。具体而言,我们发现存在以下三种简单平移幂律关系:(1) 按训练计算量配对的两个模型,在分别使用两个独立数据集训练时的训练损失间关系(训练到训练);(2) 单一模型在任意下游分布上的训练损失与测试损失间关系(训练到测试);(3) 使用两个独立训练数据集训练的两个模型在测试损失间关系(测试到测试)。这些规律在差异显著的预训练数据集(部分完全由代码构成,其他则完全不包含代码)以及多种下游任务中均保持成立。最后我们发现,在某些场景下,这些平移幂律关系相比外推单数据集缩放定律能产生更精确的预测结果。