Unraveling the Mystery of Scaling Laws: Part I

Scaling law principles indicate a power-law correlation between loss and variables such as model size, dataset size, and computational resources utilized during training. These principles play a vital role in optimizing various aspects of model pre-training, ultimately contributing to the success of large language models such as GPT-4, Llama and Gemini. However, the original scaling law paper by OpenAI did not disclose the complete details necessary to derive the precise scaling law formulas, and their conclusions are only based on models containing up to 1.5 billion parameters. Though some subsequent works attempt to unveil these details and scale to larger models, they often neglect the training dependency of important factors such as the learning rate, context length and batch size, leading to their failure to establish a reliable formula for predicting the test loss trajectory. In this technical report, we confirm that the scaling law formulations proposed in the original OpenAI paper remain valid when scaling the model size up to 33 billion, but the constant coefficients in these formulas vary significantly with the experiment setup. We meticulously identify influential factors and provide transparent, step-by-step instructions to estimate all constant terms in scaling-law formulas by training on models with only 1M~60M parameters. Using these estimated formulas, we showcase the capability to accurately predict various attributes for models with up to 33B parameters before their training, including (1) the minimum possible test loss; (2) the minimum required training steps and processed tokens to achieve a specific loss; (3) the critical batch size with an optimal time/computation trade-off at any loss value; and (4) the complete test loss trajectory with arbitrary batch size.

翻译：规模定律原理揭示了损失与模型规模、数据集规模及训练过程中使用的计算资源之间的幂律相关性。这些原理在优化模型预训练的多个方面发挥着关键作用，最终助力了GPT-4、Llama和Gemini等大型语言模型的成功。然而，OpenAI最初的规模定律论文并未披露推导精确规模定律公式所需的完整细节，其结论仅基于包含多达15亿参数的模型。尽管后续一些研究尝试揭示这些细节并将模型规模扩展至更大，但它们往往忽略了学习率、上下文长度和批次大小等重要因素对训练的依赖性，导致未能建立可靠的公式来预测测试损失轨迹。在本技术报告中，我们证实OpenAI原始论文提出的规模定律公式在将模型规模扩展至330亿参数时仍成立，但这些公式中的常数系数会随实验设置显著变化。我们细致识别了影响因素，并提供了透明的、逐步的操作指南，通过仅训练100万至6000万参数的模型来估计规模定律公式中的所有常数项。利用这些估计公式，我们展示了在训练前准确预测规模达330亿参数的模型多种属性的能力，包括：(1)最小可能测试损失；(2)达到特定损失所需的最小训练步数和处理令牌数；(3)在任何损失值下具有最优时间/计算权衡的关键批次大小；以及(4)任意批次大小下的完整测试损失轨迹。