Hyperparameter tuning can dramatically impact training stability and final performance of large-scale models. Recent works on neural network parameterisations, such as $μ$P, have enabled transfer of optimal global hyperparameters across model sizes. These works propose an empirical practice of search for optimal global base hyperparameters at a small model size, and transfer to a large size. We extend these works in two key ways. To handle scaling along most important scaling axes, we propose the Complete$^{(d)}$ Parameterisation that unifies scaling in width and depth -- using an adaptation of CompleteP -- as well as in batch-size and training duration. Secondly, with our parameterisation, we investigate per-module hyperparameter optimisation and transfer. We characterise the empirical challenges of navigating the high-dimensional hyperparameter landscape, and propose practical guidelines for tackling this optimisation problem. We demonstrate that, with the right parameterisation, hyperparameter transfer holds even in the per-module hyperparameter regime. Our study covers an extensive range of optimisation hyperparameters of modern models: learning rates, AdamW parameters, weight decay, initialisation scales, and residual block multipliers. Our experiments demonstrate significant training speed improvements in Large Language Models with the transferred per-module hyperparameters.
翻译:超参数调优能显著影响大规模模型的训练稳定性与最终性能。近期关于神经网络参数化的研究,例如 $μ$P,已实现最优全局超参数在不同模型规模间的迁移。这些研究提出了一种经验性实践:在较小模型规模上搜索最优全局基础超参数,然后迁移至较大规模。我们在两个关键方面拓展了这些工作。为处理沿最重要缩放维度的扩展问题,我们提出了 Complete$^{(d)}$ 参数化方法,它统一了宽度与深度的缩放(采用 CompleteP 的改编版本),以及批次大小与训练时长的缩放。其次,利用我们的参数化方法,我们研究了逐模块超参数优化与迁移。我们刻画了在高维超参数空间中导航所面临的实证挑战,并提出了应对这一优化问题的实用指南。我们证明,在正确的参数化下,超参数迁移在逐模块超参数体系中依然成立。我们的研究涵盖了现代模型广泛的优化超参数:学习率、AdamW 参数、权重衰减、初始化尺度以及残差块乘子。我们的实验表明,使用迁移的逐模块超参数能显著提升大型语言模型的训练速度。