Unraveling the Mystery of Scaling Laws: Part I

Scaling law principles indicate a power-law correlation between loss and variables such as model size, dataset size, and computational resources utilized during training. These principles play a vital role in optimizing various aspects of model pre-training, ultimately contributing to the success of large language models such as GPT-4, Llama and Gemini. However, the original scaling law paper by OpenAI did not disclose the complete details necessary to derive the precise scaling law formulas, and their conclusions are only based on models containing up to 1.5 billion parameters. Though some subsequent works attempt to unveil these details and scale to larger models, they often neglect the training dependency of important factors such as the learning rate, context length and batch size, leading to their failure to establish a reliable formula for predicting the test loss trajectory. In this technical report, we confirm that the scaling law formulations proposed in the original OpenAI paper remain valid when scaling the model size up to 33 billion, but the constant coefficients in these formulas vary significantly with the experiment setup. We meticulously identify influential factors and provide transparent, step-by-step instructions to estimate all constant terms in scaling-law formulas by training on models with only 1M~60M parameters. Using these estimated formulas, we showcase the capability to accurately predict various attributes for models with up to 33B parameters before their training, including (1) the minimum possible test loss; (2) the minimum required training steps and processed tokens to achieve a specific loss; (3) the critical batch size with an optimal time/computation trade-off at any loss value; and (4) the complete test loss trajectory with arbitrary batch size.

翻译：规模定律原理揭示了损失与模型大小、数据集大小以及训练过程中使用的计算资源之间的幂律相关性。这些原理在优化模型预训练的各个方面中发挥着关键作用，最终促成了GPT-4、Llama和Gemini等大型语言模型成功。然而，OpenAI最初的规模定律论文并未公开推导精确规模定律公式所需的完整细节，其结论仅基于包含多达15亿参数的模型。尽管后续一些工作试图揭示这些细节并将规模扩展至更大模型，但它们常常忽略了学习率、上下文长度和批量大小等重要因素对训练的依赖性，导致无法建立可靠的公式来预测测试损失轨迹。在本技术报告中，我们确认OpenAI原始论文中提出的规模定律公式在将模型规模扩展至330亿参数时仍然有效，但这些公式中的常数系数会随实验设置发生显著变化。我们细致地识别了影响因素，并提供了透明、逐步的指令，通过训练仅含100万到6000万参数的模型来估计规模定律公式中的所有常数项。利用这些估计的公式，我们展示了在训练前对多达330亿参数的模型进行精确预测的能力，包括：（1）最小可能的测试损失；（2）达到特定损失所需的最小训练步数和处理的令牌数；（3）在任何损失值下具有最优时间/计算权衡的关键批量大小；（4）任意批量大小下的完整测试损失轨迹。