μ参数化在混合专家模型中的应用 ($μ$-Parametrization for Mixture of Experts)

Jan Małaśnicki,Kamil Ciebiera,Mateusz Boruń,Maciej Pióro,Jan Ludziejewski,Maciej Stefaniak,Michał Krutul,Sebastian Jaszczur,Marek Cygan,Kamil Adamczewski,Jakub Krajewski

Recent years have seen a growing interest and adoption of LLMs, with Mixture-of-Experts (MoE) emerging as a leading architecture in extremely large models. Currently, the largest open-source models reach over $1$T parameters. At such scales, hyperparameter tuning becomes prohibitively expensive. Precisely for this reason, the $\mu$Transfer is becoming a key technique. It allows for seamless transfer of optimal hyperparameters across model scales, resulting in a huge reduction in tuning costs. However, existing work has primarily focused on dense LLMs, leaving MoE architectures unexplored. In this work, we derive a $\mu$-Parameterization for MoE, providing theoretical guarantees for feature learning across model widths. Our experiments demonstrate that the optimal learning rate reliably transfers across model sizes, establishing a foundation for efficient hyperparameter tuning in large-scale MoE models.

翻译：近年来，大规模语言模型（LLMs）的关注度与采用率持续增长，其中混合专家（MoE）架构已成为超大规模模型的主流设计。目前，最大的开源模型参数量已超过$1$T。在此规模下，超参数调优的成本变得极其高昂。正因如此，$\mu$Transfer正成为一项关键技术，它能够实现最优超参数在不同模型规模间的无缝迁移，从而大幅降低调优开销。然而，现有研究主要集中于稠密LLMs，尚未对MoE架构进行深入探索。本研究推导出适用于MoE的$\mu$参数化方法，为跨模型宽度的特征学习提供了理论保证。实验表明，最优学习率能够在不同模型规模间可靠迁移，这为大规模MoE模型的高效超参数调优奠定了重要基础。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日