We study compute efficiency of LLM training when using different parameterizations, i.e., rules for adjusting model and optimizer hyperparameters (HPs) as model size changes. Some parameterizations fail to transfer optimal base HPs (such as learning rate) across changes in model depth, requiring practitioners to either re-tune these HPs as they scale up (expensive), or accept sub-optimal training when re-tuning is prohibitive. Even when they achieve HP transfer, we develop theory to show parameterizations may still exist in the lazy learning regime where layers learn only features close to their linearization, preventing effective use of depth and nonlinearity. Finally, we identify and adopt the parameterization we call CompleteP that achieves both depth-wise HP transfer and non-lazy learning in all layers. CompleteP enables a wider range of model width/depth ratios to remain compute-efficient, unlocking shapes better suited for different hardware settings and operational contexts. Moreover, CompleteP enables 12-34% compute efficiency improvements over the prior state-of-the-art. All experiments were run on Cerebras CS-3 systems. A minimal implementation is available at https://github.com/EleutherAI/nanoGPT-mup/tree/completep.
翻译:本研究探讨了在使用不同参数化方法(即随模型规模变化而调整模型与优化器超参数(HPs)的规则)时,大语言模型(LLM)训练的计算效率问题。部分参数化方法无法在模型深度变化时传递最优基础超参数(如学习率),导致实践者必须在扩大模型规模时重新调优这些超参数(成本高昂),或在无法重新调优时接受次优的训练效果。即使某些参数化方法实现了超参数传递,我们通过理论分析发现它们仍可能处于惰性学习状态,即各层仅学习接近其线性化的特征,从而无法有效利用深度和非线性特性。最终,我们提出并采用了一种称为CompleteP的参数化方法,该方法在所有层中同时实现了深度方向的超参数传递和非惰性学习。CompleteP使得更广泛的模型宽度/深度比例能够保持计算高效性,从而释放出更适应不同硬件配置和操作场景的模型结构。此外,与现有最优方法相比,CompleteP实现了12-34%的计算效率提升。所有实验均在Cerebras CS-3系统上运行。精简实现代码发布于https://github.com/EleutherAI/nanoGPT-m/tree/completep。