Large neural network models have become a mainstay of natural language processing and computer vision, yet their initialization and learning rates are set in a largely heuristic fashion, potentially varying from paper to paper and one model size to the next. The $\mu$-Parameterization ($\mu$P) offers a potential solution to these challenges, yielding scaling rules for model initialization and learning rates, and reportedly enabling zero-shot hyperparameter transfer from small to large models in a variety of cases. Despite the evident promise, the $\mu$P scaling rules are not yet widely adopted, perhaps due to higher implementation complexity, many variations, or complex theoretical background. This work investigates $\mu$P empirically, focusing on the ubiquitous transformer architecture, and aims to answer a simple question: does $\mu$-Transfer yield optimal learning rates in practice? Studying models with up to 10B parameters and training budgets of up to 190B tokens, we find $\mu$-Transfer works as intended for the majority of important cases, yet also identify a few cases where it may not. Our experiment codebase is available at https://github.com/lucaslingle/mu_transformer/
翻译:大型神经网络模型已成为自然语言处理和计算机视觉的主要工具,然而其初始化和学习率的设置在很大程度上依赖经验方法,可能因论文或模型规模不同而存在差异。$μ$参数化($\mu$P)为这些挑战提供了潜在解决方案,给出了模型初始化和学习率的缩放规则,并在多种情况下报告了从小型模型到大型模型的零样本超参数迁移能力。尽管前景显著,$\mu$P缩放规则尚未得到广泛采用,这可能是由于实现复杂度较高、变体繁多或理论背景复杂。本研究对$\mu$P进行实证探索,聚焦于广泛使用的Transformer架构,旨在回答一个简单问题:$\mu$-迁移在实践中能否产生最优学习率?通过研究参数量高达100亿的模型和训练预算高达1900亿个token的设置,我们发现$\mu$-迁移在大多数重要场景中表现符合预期,但也识别出少数可能失效的情况。我们的实验代码库可在https://github.com/lucaslingle/mu_transformer/获取。