Scalable Training of Mixture-of-Experts Models with Megatron Core

Zijie Yan,Hongxiao Bai,Xin Yao,Dennis Liu,Tong Liu,Hongbin Liu,Pingtian Li,Evan Wu,Shiqing Fan,Li Tao,Robin Zhang,Yuzhong Wang,Shifang Xu,Jack Chang,Xuwen Chen,Kunlun Li,Yan Bai,Gao Deng,Nan Zheng,Vijay Anand Korthikanti,Abhinav Khattar,Ethan He,Soham Govande,Sangkug Lym,Zhongbo Zhu,Qi Zhang,Haochen Yuan,Xiaowei Ren,Deyu Fu,Tailai Ma,Shunkang Zhang,Jiang Shao,Ray Wang,Vasudevan Rengasamy,Rachit Garg,Santosh Bhavani,Xipeng Li,Chandler Zhou,David Wu,Yingcan Wei,Ashwath Aithal,Michael Andersch,Mohammad Shoeybi,Jiajie Yao,June Yang

from arxiv, Technical Report. 88 pages. 42 figures

Scaling Mixture-of-Experts (MoE) training introduces systems challenges absent in dense models. Because each token activates only a subset of experts, this sparsity allows total parameters to grow much faster than per-token computation, creating coupled constraints across memory, communication, and computation. Optimizing one dimension often shifts pressure to another, demanding co-design across the full system stack. We address these challenges for MoE training through integrated optimizations spanning memory (fine-grained recomputation, offloading, etc.), communication (optimized dispatchers, overlapping, etc.), and computation (Grouped GEMM, fusions, CUDA Graphs, etc.). The framework also provides Parallel Folding for flexible multi-dimensional parallelism, low-precision training support for FP8 and NVFP4, and efficient long-context training. On NVIDIA GB300 and GB200, it achieves 1,233/1,048 TFLOPS/GPU for DeepSeek-V3-685B and 974/919 TFLOPS/GPU for Qwen3-235B. As a performant, scalable, and production-ready open-source solution, it has been used across academia and industry for training MoE models ranging from billions to trillions of parameters on clusters scaling up to thousands of GPUs. This report explains how these techniques work, their trade-offs, and their interactions at the systems level, providing practical guidance for scaling MoE models with Megatron Core.

翻译：扩展混合专家模型的训练带来了稠密模型中不存在的系统挑战。由于每个令牌仅激活专家子集，这种稀疏性使得总参数量的增长速度远高于每令牌计算量，从而在内存、通信和计算之间形成了相互耦合的约束。优化某一维度通常会将压力转移到其他维度，这要求在整个系统栈进行协同设计。我们通过涵盖内存（细粒度重计算、卸载等）、通信（优化调度器、重叠等）和计算（分组GEMM、算子融合、CUDA图等）的集成优化方案来解决MoE训练中的这些挑战。该框架还提供并行折叠以实现灵活的多维并行、针对FP8和NVFP4的低精度训练支持，以及高效的长上下文训练。在NVIDIA GB300和GB200平台上，该框架为DeepSeek-V3-685B模型实现了1,233/1,048 TFLOPS/GPU，为Qwen3-235B模型实现了974/919 TFLOPS/GPU。作为一个高性能、可扩展且可用于生产的开源解决方案，它已在学术界和工业界被用于训练参数规模从数十亿到数万亿、GPU集群规模高达数千张的MoE模型。本报告阐述了这些技术的工作原理、权衡取舍及其在系统层面的相互作用，为使用Megatron Core扩展MoE模型提供了实用指导。