Large neural network models have become a mainstay of natural language processing and computer vision, yet their initialization and learning rates are set in a largely heuristic fashion, potentially varying from paper to paper and one model size to the next. The $\mu$-Parameterization ($\mu$P) offers a potential solution to these challenges, yielding scaling rules for model initialization and learning rates, and reportedly enabling zero-shot hyperparameter transfer from small to large models in a variety of cases. Despite the evident promise, the $\mu$P scaling rules are not yet widely adopted, perhaps due to higher implementation complexity, many variations, or complex theoretical background. This work investigates $\mu$P empirically, focusing on the ubiquitous transformer architecture, and aims to answer a simple question: does $\mu$-Transfer yield optimal learning rates in practice? From models with 2M to 10B parameters, we show that $\mu$-Transfer works as intended for the majority of important cases, but also identify some surprising cases where it may not. Our experiment codebase is available at https://github.com/lucaslingle/mu_transformer/
翻译:大型神经网络模型已成为自然语言处理和计算机视觉领域的重要支柱,但其初始化和学习率的设置大多依赖于启发式方法,在不同论文和不同模型规模之间可能差异显著。$\mu$-参数化($\mu$P)为这些挑战提供了潜在解决方案,它给出了模型初始化和学习率的缩放规则,并在多种情况下实现了从小型到大型模型的零样本超参数迁移。尽管前景可观,但$\mu$P缩放规则尚未广泛采用,原因可能在于其实现复杂度较高、存在多种变体或理论背景复杂。本文通过实证研究$\mu$P,聚焦于广泛使用的Transformer架构,旨在回答一个简单问题:在实际应用中,$\mu$-迁移能否得到最优学习率?通过对参数规模从2M到10B的模型进行实验,我们发现$\mu$-迁移在大多数重要场景下能按预期工作,但也识别出一些可能失效的意外情况。我们的实验代码库可在https://github.com/lucaslingle/mu_transformer/获取。