We find that the cross-entropy loss curves of neural language models empirically adhere to a scaling law with learning rate (LR) annealing over training steps ($s$): $$L(s) = L_0 + A\cdot S_1^{-\alpha} - C\cdot S_2$$ Where $S_1$ is forward area and $S_2$ is learning rate annealing area. This formulation takes into account two factors: (1) The forward scaling defined as typical scaling law, and (2) the additional loss drop brought by LR annealing. Therefore, this formulation can describe the full loss curve at each step, rather than the single loss point at the end of training. Applying the scaling law with LR annealing and fitting only one or two training curves, we can accurately predict the loss of language model training at any given step and across any learning rate scheduler (LRS). Furthermore, this equation accurately describes the dynamics during training process, and provides a theoretical verification and explanation for numerous experimental findings of previous studies, particularly those focusing on LR schedule and LR annealing. The resulting insights, also serve as a guide for researchers to select critical LRS in advance by prediction using our equation. Most significantly, since all the points in a full training curve follow the equation, we can achieve accurate loss prediction at any given step across any learning rate scheduler, while expending less than 1\% of the computational cost required by the chinchilla scaling law to fit language modeling loss. This approach extremely democratizes scaling law fitting and predicting in developing large language models.
翻译:我们发现神经语言模型的交叉熵损失曲线在训练步数($s$)上经验性地遵循一种包含学习率(LR)退火的缩放定律:$$L(s) = L_0 + A\cdot S_1^{-\alpha} - C\cdot S_2$$ 其中 $S_1$ 为前向面积,$S_2$ 为学习率退火面积。该公式考虑了两个因素:(1)定义为典型缩放定律的前向缩放,以及(2)由学习率退火带来的额外损失下降。因此,该公式可以描述每一步的完整损失曲线,而非训练结束时的单一损失点。应用包含学习率退火的缩放定律并仅拟合一两条训练曲线,我们便能准确预测语言模型训练在任何给定步骤以及任何学习率调度器(LRS)下的损失。此外,该方程准确地描述了训练过程中的动态变化,并为先前众多研究的实验结果,特别是那些关注学习率调度与学习率退火的研究,提供了理论验证与解释。由此产生的见解,也可作为研究人员通过使用我们的方程进行预测来预先选择关键学习率调度器的指南。最重要的是,由于完整训练曲线中的所有点都遵循该方程,我们能够在任何学习率调度器下准确预测任意给定步骤的损失,同时消耗的计算成本不到使用 chinchilla 缩放定律拟合语言建模损失所需成本的 1%。这种方法极大地促进了大型语言模型开发中缩放定律拟合与预测的普及化。