Llama 3 Meets MoE: Efficient Upcycling

Scaling large language models (LLMs) significantly improves performance but comes with prohibitive computational costs. Mixture-of-Experts (MoE) models offer an efficient alternative, increasing capacity without a proportional rise in compute requirements. However, training MoE models from scratch poses challenges like overfitting and routing instability. We present an efficient training recipe leveraging pre-trained dense checkpoints, training an 8-Expert Top-2 MoE model from Llama 3-8B with less than $1\%$ of typical pre-training compute. Our approach enhances downstream performance on academic benchmarks, achieving a $\textbf{2%}$ improvement in 0-shot accuracy on MMLU, while reaching a Model FLOPs Utilization (MFU) of $\textbf{46.8%}$ during training using our framework. We also integrate online upcycling in NeMo for seamless use of pre-trained weights, enabling cost-effective development of high-capacity MoE models.

翻译：扩展大型语言模型（LLM）能显著提升性能，但会带来高昂的计算成本。混合专家（MoE）模型提供了一种高效的替代方案，能在不按比例增加计算需求的前提下提升模型容量。然而，从头开始训练 MoE 模型面临着过拟合和路由不稳定性等挑战。我们提出了一种利用预训练密集模型检查点的高效训练方案，以低于典型预训练计算量 $1\%$ 的成本，从 Llama 3-8B 训练出一个 8 专家 Top-2 MoE 模型。我们的方法提升了模型在学术基准测试中的下游性能，在 MMLU 上实现了 $\textbf{2%}$ 的零样本准确率提升，同时在使用我们的框架训练时，模型浮点运算利用率（MFU）达到了 $\textbf{46.8%}$。我们还在 NeMo 中集成了在线升级改造功能，以便无缝使用预训练权重，从而支持高容量 MoE 模型的经济高效开发。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日