Scaling Law with Learning Rate Annealing

We find that the cross-entropy loss curves of neural language models empirically adhere to a scaling law with learning rate (LR) annealing over training steps ($s$): $$L(s) = L_0 + A\cdot S_1^{-\alpha} - C\cdot S_2$$ Where $S_1$ is forward area and $S_2$ is learning rate annealing area. This formulation takes into account two factors: (1) The forward scaling defined as typical scaling law, and (2) the additional loss drop brought by LR annealing. Therefore, this formulation can describe the full loss curve at each step, rather than the single loss point at the end of training. Applying the scaling law with LR annealing and fitting only one or two training curves, we can accurately predict the loss of language model training at any given step and across any learning rate scheduler (LRS). Furthermore, this equation accurately describes the dynamics during training process, and provides a theoretical verification and explanation for numerous experimental findings of previous studies, particularly those focusing on LR schedule and LR annealing. The resulting insights, also serve as a guide for researchers to select critical LRS in advance by prediction using our equation. Most significantly, since all the points in a full training curve follow the equation, we can achieve accurate loss prediction at any given step across any learning rate scheduler, while expending less than 1\% of the computational cost required by the chinchilla scaling law to fit language modeling loss. This approach extremely democratizes scaling law fitting and predicting in developing large language models.

翻译：我们发现神经语言模型的交叉熵损失曲线在训练步数（$s$）上经验性地遵循一种包含学习率（LR）退火的缩放定律：$$L(s) = L_0 + A\cdot S_1^{-\alpha} - C\cdot S_2$$ 其中 $S_1$ 为前向面积，$S_2$ 为学习率退火面积。该公式考虑了两个因素：（1）定义为典型缩放定律的前向缩放，以及（2）由学习率退火带来的额外损失下降。因此，该公式可以描述每一步的完整损失曲线，而非训练结束时的单一损失点。应用包含学习率退火的缩放定律并仅拟合一两条训练曲线，我们便能准确预测语言模型训练在任何给定步骤以及任何学习率调度器（LRS）下的损失。此外，该方程准确地描述了训练过程中的动态变化，并为先前众多研究的实验结果，特别是那些关注学习率调度与学习率退火的研究，提供了理论验证与解释。由此产生的见解，也可作为研究人员通过使用我们的方程进行预测来预先选择关键学习率调度器的指南。最重要的是，由于完整训练曲线中的所有点都遵循该方程，我们能够在任何学习率调度器下准确预测任意给定步骤的损失，同时消耗的计算成本不到使用 chinchilla 缩放定律拟合语言建模损失所需成本的 1%。这种方法极大地促进了大型语言模型开发中缩放定律拟合与预测的普及化。

相关内容

Scaling Law

关注 0

从目前的研究总结发现，模型规模的扩展是LLM能力提升的一个关键因素。从GPT-3的175B参数量到PaLM的540B记录，都验证了模型规模的扩展，导致能力的提升。当然，大的模型尺寸是必不可少的，但是扩展定律并不仅限于此，它一共包括三个方面：模型尺寸（Model size）数据规模（Data size）总计算量（Total compute）此外，预训练数据的质量在保证模型性能方面有着关键作用，因此在扩展语料库时，要注意数据收集和清理的策略。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日