This work studies the general principles of improving the learning of language models (LMs), which aims at reducing the necessary training steps for achieving superior performance. Specifically, we present a theory for the optimal learning of LMs. We first propose an objective that optimizes LM learning by maximizing the data compression ratio in an "LM-training-as-lossless-compression" view. Then, we derive a theorem, named Learning Law, to reveal the properties of the dynamics in the optimal learning process under our objective. The theorem is then validated by experiments on a linear classification and a real-world language modeling task. Finally, we empirically verify that the optimal learning of LMs essentially stems from the improvement of the coefficients in the scaling law of LMs, indicating great promise and significance for designing practical learning acceleration methods. Our code can be found at https://aka.ms/LearningLaw.
翻译:本文研究了改进语言模型(LM)学习的一般原理,旨在减少达成卓越性能所需的训练步骤。具体而言,我们提出了一个关于LM最优学习的理论。首先,我们提出一个目标函数,通过“将LM训练视为无损压缩”的观点,最大化数据压缩比来优化LM学习。接着,我们推导出一个名为“学习定律”的定理,揭示了在目标函数下最优学习过程中动态特性的性质。该定理随后在一个线性分类任务和一个真实世界语言建模任务上通过实验得到验证。最后,我们通过经验验证,LM的最优学习本质上源于LM标度律系数的改进,这表明设计实用的学习加速方法具有巨大前景和重要意义。我们的代码可在https://aka.ms/LearningLaw获取。