MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core

Dennis Liu,Zijie Yan,Xin Yao,Tong Liu,Vijay Korthikanti,Evan Wu,Shiqing Fan,Gao Deng,Hongxiao Bai,Jianbin Chang,Ashwath Aithal,Michael Andersch,Mohammad Shoeybi,Jiajie Yao,Chandler Zhou,David Wu,Xipeng Li,June Yang

Mixture of Experts (MoE) models enhance neural network scalability by dynamically selecting relevant experts per input token, enabling larger model sizes while maintaining manageable computation costs. However, efficient training of large-scale MoE models across thousands of GPUs presents significant challenges due to limitations in existing parallelism strategies. We introduce an end-to-end training framework for large-scale MoE models that utilizes five-dimensional hybrid parallelism: Tensor Parallelism, Expert Parallelism, Context Parallelism, Data Parallelism, and Pipeline Parallelism. Central to our approach is MoE Parallel Folding, a novel strategy that decouples the parallelization of attention and MoE layers in Transformer models, allowing each layer type to adopt optimal parallel configurations. Additionally, we develop a flexible token-level dispatcher that supports both token-dropping and token-dropless MoE training across all five dimensions of parallelism. This dispatcher accommodates dynamic tensor shapes and coordinates different parallelism schemes for Attention and MoE layers, facilitating complex parallelism implementations. Our experiments demonstrate significant improvements in training efficiency and scalability. We achieve up to 49.3% Model Flops Utilization (MFU) for the Mixtral 8x22B model and 39.0% MFU for the Qwen2-57B-A14B model on H100 GPUs, outperforming existing methods. The framework scales efficiently up to 1,024 GPUs and maintains high performance with sequence lengths up to 128K tokens, validating its effectiveness for large-scale MoE model training. The code is available in Megatron-Core.

翻译：混合专家（Mixture of Experts, MoE）模型通过为每个输入令牌动态选择相关专家，增强了神经网络的可扩展性，使得在保持可控计算成本的同时能够训练更大规模的模型。然而，由于现有并行策略的局限性，在数千个GPU上高效训练大规模MoE模型仍面临重大挑战。本文提出了一种用于大规模MoE模型的端到端训练框架，该框架采用五维混合并行策略：张量并行、专家并行、上下文并行、数据并行和流水线并行。我们方法的核心是MoE并行折叠，这是一种新颖的策略，它将Transformer模型中注意力层与MoE层的并行化解耦，使每种层类型可以采用最优的并行配置。此外，我们开发了一个灵活的令牌级调度器，该调度器在所有五个并行维度上同时支持带令牌丢弃和不带令牌丢弃的MoE训练。此调度器能够适应动态张量形状，并为注意力层和MoE层协调不同的并行方案，从而促进了复杂并行策略的实现。我们的实验结果表明，该框架在训练效率和可扩展性方面均有显著提升。在H100 GPU上，我们为Mixtral 8x22B模型实现了高达49.3%的模型浮点运算利用率（Model Flops Utilization, MFU），为Qwen2-57B-A14B模型实现了39.0%的MFU，性能优于现有方法。该框架可高效扩展至1,024个GPU，并在序列长度高达128K令牌时仍保持高性能，验证了其在大规模MoE模型训练中的有效性。相关代码已在Megatron-Core中开源。