A Large-Scale Exploration of $μ$-Transfer

Large neural network models have become a mainstay of natural language processing and computer vision, yet their initialization and learning rates are set in a largely heuristic fashion, potentially varying from paper to paper and one model size to the next. The $\mu$-Parameterization ($\mu$P) offers a potential solution to these challenges, yielding scaling rules for model initialization and learning rates, and reportedly enabling zero-shot hyperparameter transfer from small to large models in a variety of cases. Despite the evident promise, the $\mu$P scaling rules are not yet widely adopted, perhaps due to higher implementation complexity, many variations, or complex theoretical background. This work investigates $\mu$P empirically, focusing on the ubiquitous transformer architecture, and aims to answer a simple question: does $\mu$-Transfer yield optimal learning rates in practice? From models with 2M to 10B parameters, we show that $\mu$-Transfer works as intended for the majority of important cases, but also identify some surprising cases where it may not. Our experiment codebase is available at https://github.com/lucaslingle/mu_transformer/

翻译：大型神经网络模型已成为自然语言处理和计算机视觉领域的重要支柱，但其初始化和学习率的设置大多依赖于启发式方法，在不同论文和不同模型规模之间可能差异显著。$\mu$-参数化（$\mu$P）为这些挑战提供了潜在解决方案，它给出了模型初始化和学习率的缩放规则，并在多种情况下实现了从小型到大型模型的零样本超参数迁移。尽管前景可观，但$\mu$P缩放规则尚未广泛采用，原因可能在于其实现复杂度较高、存在多种变体或理论背景复杂。本文通过实证研究$\mu$P，聚焦于广泛使用的Transformer架构，旨在回答一个简单问题：在实际应用中，$\mu$-迁移能否得到最优学习率？通过对参数规模从2M到10B的模型进行实验，我们发现$\mu$-迁移在大多数重要场景下能按预期工作，但也识别出一些可能失效的意外情况。我们的实验代码库可在https://github.com/lucaslingle/mu_transformer/获取。

相关内容

CASES

关注 4

CASES：International Conference on Compilers, Architectures, and Synthesis for Embedded Systems。 Explanation：嵌入式系统编译器、体系结构和综合国际会议。 Publisher：ACM。 SIT： http://dblp.uni-trier.de/db/conf/cases/index.html

WWW 2024 | GraphTranslator: 将图模型对齐大语言模型

专知会员服务

27+阅读 · 2024年3月25日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Query2box: 使用盒嵌入对向量空间中的知识图谱进行推理，Query2box: Reasoning over Knowledge Graphs in Vector Space Using Box Embeddings

专知会员服务

46+阅读 · 2020年5月11日