Predictable Scale: Part I -- Optimal Hyperparameter Scaling Law in Large Language Model Pretraining

The impressive capabilities of Large Language Models (LLMs) across diverse tasks are now well-established, yet their effective deployment necessitates careful hyperparameter optimization. Through extensive empirical studies involving grid searches across diverse configurations, we discover universal scaling laws governing these hyperparameters: optimal learning rate follows a power-law relationship with both model parameters and data sizes, while optimal batch size scales primarily with data sizes. Our analysis reveals a convex optimization landscape for hyperparameters under fixed models and data size conditions. This convexity implies an optimal hyperparameter plateau. We contribute a universal, plug-and-play optimal hyperparameter tool for the community. Its estimated values on the test set are merely 0.07\% away from the globally optimal LLM performance found via an exhaustive search. These laws demonstrate remarkable robustness across variations in model sparsity, training data distribution, and model shape. To our best known, this is the first work that unifies different model shapes and structures, such as Mixture-of-Experts models and dense transformers, as well as establishes optimal hyperparameter scaling laws across diverse data distributions. This exhaustive optimization process demands substantial computational resources, utilizing nearly one million NVIDIA H800 GPU hours to train 3,700 LLMs of varying sizes and hyperparameters from scratch and consuming approximately 100 trillion tokens in total. To facilitate reproducibility and further research, we will progressively release all loss measurements and model checkpoints through our designated repository https://step-law.github.io/

翻译：大型语言模型（LLMs）在多样化任务中展现出的卓越能力现已得到充分证实，然而其有效部署仍需细致的超参数优化。通过涵盖多样化配置的网格搜索所进行的广泛实证研究，我们发现了支配这些超参数的普适缩放定律：最优学习率与模型参数量及数据规模均呈现幂律关系，而最优批处理规模则主要随数据规模变化。我们的分析揭示了在固定模型和数据规模条件下，超参数优化问题具有凸优化特性。这一凸性意味着存在一个最优超参数平台。我们为学术界贡献了一个通用的即插即用式最优超参数工具。其在测试集上的估计值与通过穷举搜索发现的全局最优LLM性能仅相差0.07%。这些定律在模型稀疏度、训练数据分布及模型架构形态的变化中表现出显著的鲁棒性。据我们所知，这是首个统一不同模型形态与结构（如专家混合模型与稠密Transformer），并在多样化数据分布上建立最优超参数缩放定律的研究工作。此项穷举优化过程消耗了巨大的计算资源，使用了近百万NVIDIA H800 GPU小时从头训练了3,700个不同规模和超参数的LLMs，总计消耗约100万亿token。为促进可复现性及进一步研究，我们将通过指定存储库 https://step-law.github.io/ 逐步发布所有损失测量结果和模型检查点。