Scalable Training of Mixture-of-Experts Models with Megatron Core

Zijie Yan,Hongxiao Bai,Xin Yao,Dennis Liu,Tong Liu,Hongbin Liu,Pingtian Li,Evan Wu,Shiqing Fan,Li Tao,Robin Zhang,Yuzhong Wang,Shifang Xu,Jack Chang,Xuwen Chen,Kunlun Li,Yan Bai,Gao Deng,Nan Zheng,Vijay Anand Korthikanti,Abhinav Khattar,Ethan He,Soham Govande,Sangkug Lym,Zhongbo Zhu,Qi Zhang,Haochen Yuan,Xiaowei Ren,Deyu Fu,Tailai Ma,Shunkang Zhang,Jiang Shao,Ray Wang,Santosh Bhavani,Xipeng Li,Chandler Zhou,David Wu,Yingcan Wei,Ashwath Aithal,Michael Andersch,Mohammad Shoeybi,Jiajie Yao,June Yang

from arxiv, Technical Report. 88 pages. 42 figures

Scaling Mixture-of-Experts (MoE) training introduces systems challenges absent in dense models. Because each token activates only a subset of experts, this sparsity allows total parameters to grow much faster than per-token computation, creating coupled constraints across memory, communication, and computation. Optimizing one dimension often shifts pressure to another, demanding co-design across the full system stack. We address these challenges for MoE training through integrated optimizations spanning memory (fine-grained recomputation, offloading, etc.), communication (optimized dispatchers, overlapping, etc.), and computation (Grouped GEMM, fusions, CUDA Graphs, etc.). The framework also provides Parallel Folding for flexible multi-dimensional parallelism, low-precision training support for FP8 and NVFP4, and efficient long-context training. On NVIDIA GB300 and GB200, it achieves 1,233/1,048 TFLOPS/GPU for DeepSeek-V3-685B and 974/919 TFLOPS/GPU for Qwen3-235B. As a performant, scalable, and production-ready open-source solution, it has been used across academia and industry for training MoE models ranging from billions to trillions of parameters on clusters scaling up to thousands of GPUs. This report explains how these techniques work, their trade-offs, and their interactions at the systems level, providing practical guidance for scaling MoE models with Megatron Core.

翻译：扩展混合专家模型的训练带来了稠密模型中不存在的系统挑战。由于每个令牌仅激活专家子集，这种稀疏性使得总参数量可以比每令牌计算量增长快得多，从而在内存、通信和计算之间形成了耦合约束。优化一个维度通常会将压力转移到另一个维度，这要求在整个系统栈上进行协同设计。我们通过集成优化解决了MoE训练的这些挑战，这些优化涵盖内存（细粒度重计算、卸载等）、通信（优化调度器、重叠等）和计算（分组GEMM、算子融合、CUDA图等）。该框架还提供了并行折叠以实现灵活的多维并行、针对FP8和NVFP4的低精度训练支持，以及高效的长上下文训练。在NVIDIA GB300和GB200上，它为DeepSeek-V3-685B实现了1,233/1,048 TFLOPS/GPU，为Qwen3-235B实现了974/919 TFLOPS/GPU。作为一个高性能、可扩展且可用于生产的开源解决方案，它已在学术界和工业界广泛用于训练参数规模从数十亿到数万亿、GPU集群规模高达数千的MoE模型。本报告阐述了这些技术的工作原理、权衡取舍及其在系统层面的相互作用，为使用Megatron Core扩展MoE模型提供了实用指导。