A Large-Scale Exploration of $μ$-Transfer

Large neural network models have become a mainstay of natural language processing and computer vision, yet their initialization and learning rates are set in a largely heuristic fashion, potentially varying from paper to paper and one model size to the next. The $\mu$-Parameterization ($\mu$P) offers a potential solution to these challenges, yielding scaling rules for model initialization and learning rates, and reportedly enabling zero-shot hyperparameter transfer from small to large models in a variety of cases. Despite the evident promise, the $\mu$P scaling rules are not yet widely adopted, perhaps due to higher implementation complexity, many variations, or complex theoretical background. This work investigates $\mu$P empirically, focusing on the ubiquitous transformer architecture, and aims to answer a simple question: does $\mu$-Transfer yield optimal learning rates in practice? Studying models with up to 10B parameters and training budgets of up to 190B tokens, we find $\mu$-Transfer works as intended for the majority of important cases, yet also identify a few cases where it may not. Our experiment codebase is available at https://github.com/lucaslingle/mu_transformer/

翻译：大型神经网络模型已成为自然语言处理和计算机视觉的主要工具，然而其初始化和学习率的设置在很大程度上依赖经验方法，可能因论文或模型规模不同而存在差异。$μ$参数化（$\mu$P）为这些挑战提供了潜在解决方案，给出了模型初始化和学习率的缩放规则，并在多种情况下报告了从小型模型到大型模型的零样本超参数迁移能力。尽管前景显著，$\mu$P缩放规则尚未得到广泛采用，这可能是由于实现复杂度较高、变体繁多或理论背景复杂。本研究对$\mu$P进行实证探索，聚焦于广泛使用的Transformer架构，旨在回答一个简单问题：$\mu$-迁移在实践中能否产生最优学习率？通过研究参数量高达100亿的模型和训练预算高达1900亿个token的设置，我们发现$\mu$-迁移在大多数重要场景中表现符合预期，但也识别出少数可能失效的情况。我们的实验代码库可在https://github.com/lucaslingle/mu_transformer/获取。

相关内容

CASES

关注 4

CASES：International Conference on Compilers, Architectures, and Synthesis for Embedded Systems。 Explanation：嵌入式系统编译器、体系结构和综合国际会议。 Publisher：ACM。 SIT： http://dblp.uni-trier.de/db/conf/cases/index.html

WWW 2024 | GraphTranslator: 将图模型对齐大语言模型

专知会员服务

27+阅读 · 2024年3月25日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Query2box: 使用盒嵌入对向量空间中的知识图谱进行推理，Query2box: Reasoning over Knowledge Graphs in Vector Space Using Box Embeddings

专知会员服务

46+阅读 · 2020年5月11日