Optimal configuration of the learning rate (LR) is a fundamental yet formidable challenge in large-scale pre-training. Given the stringent trade-off between training costs and model performance, the pivotal question is whether the optimal LR can be accurately extrapolated from low-cost experiments. In this paper, we formalize this investigation into two distinct research paradigms: Fitting and Transfer. Within the Fitting Paradigm, we innovatively introduce a Scaling Law for search factor, effectively reducing the search complexity from O(n^3) to O(n*C_D*C_η) via predictive modeling. Within the Transfer Paradigm, we extend the principles of $μ$Transfer to the Mixture of Experts (MoE) architecture, broadening its applicability to encompass model depth, weight decay, and token horizons. By pushing the boundaries of existing hyperparameter research in terms of scale, we conduct a comprehensive comparison between these two paradigms. Our empirical results challenge the scalability of the widely adopted $μ$ Transfer in large-scale pre-training scenarios. Furthermore, we provide a rigorous analysis through the dual lenses of training stability and feature learning to elucidate the underlying reasons why module-wise parameter tuning underperforms in large-scale settings. This work offers systematic practical guidelines and a fresh theoretical perspective for optimizing industrial-level pre-training.
翻译:学习率(LR)的最优配置是大规模预训练中一个基础且极具挑战性的问题。鉴于训练成本与模型性能之间的严格权衡,核心问题在于:是否能够从低成本的实验中准确外推出最优学习率。在本文中,我们将此研究形式化为两种不同的研究范式:拟合范式与迁移范式。在拟合范式中,我们创新性地引入了搜索因子的缩放定律,通过预测建模将搜索复杂度从 O(n^3) 有效降低至 O(n*C_D*C_η)。在迁移范式中,我们将 $μ$Transfer 的原理扩展至混合专家(MoE)架构,从而将其适用范围拓宽至模型深度、权重衰减和令牌范围。通过将现有超参数研究的规模边界推向极致,我们对这两种范式进行了全面比较。我们的实证结果对广泛采用的 $μ$Transfer 在大规模预训练场景中的可扩展性提出了挑战。此外,我们通过训练稳定性和特征学习的双重视角进行了严谨分析,以阐明模块级参数调优在大规模设置中表现不佳的根本原因。这项工作为优化工业级预训练提供了系统的实践指南和全新的理论视角。