Large neural network models have become a mainstay of natural language processing and computer vision, yet their initialization and learning rates are set in a largely heuristic fashion, potentially varying from paper to paper and one model size to the next. The $\mu$-Parameterization ($\mu$P) offers a potential solution to these challenges, yielding scaling rules for model initialization and learning rates, and reportedly enabling zero-shot hyperparameter transfer from small to large models in a variety of cases. Despite the evident promise, the $\mu$P scaling rules are not yet widely adopted, perhaps due to higher implementation complexity, many variations, or complex theoretical background. This work investigates $\mu$P empirically, focusing on the ubiquitous transformer architecture, and aims to answer a simple question: does $\mu$-Transfer yield optimal learning rates in practice? From models with 2M to 10B parameters, we show that $\mu$-Transfer works as intended for the majority of important cases, but also identify some surprising cases where it may not.
翻译:大型神经网络模型已成为自然语言处理和计算机视觉领域的主流,但其初始化和学习率往往以高度启发式的方式设置,可能因论文不同或模型规模变化而有所差异。$\mu$-参数化($\mu$P)为这些挑战提供了潜在解决方案,给出了模型初始化和学习率的缩放规则,并在多种情况下实现了从小模型到大模型的零样本超参数迁移。尽管前景广阔,但$\mu$P缩放规则尚未被广泛采纳,原因可能在于其实现复杂度较高、变体众多或理论背景复杂。本研究通过实证方法探究$\mu$P,重点关注普遍使用的Transformer架构,旨在回答一个简单问题:$\mu$-迁移在实践中能否产生最优学习率?通过对从2M到10B参数的模型进行实验,我们发现$\mu$-迁移在大多数重要情况下能按预期工作,但也识别出一些可能失效的意外情况。