SE-MoE: A Scalable and Efficient Mixture-of-Experts Distributed Training and Inference System

With the increasing diversity of ML infrastructures nowadays, distributed training over heterogeneous computing systems is desired to facilitate the production of big models. Mixture-of-Experts (MoE) models have been proposed to lower the cost of training subject to the overall size of models/data through gating and parallelism in a divide-and-conquer fashion. While DeepSpeed has made efforts in carrying out large-scale MoE training over heterogeneous infrastructures, the efficiency of training and inference could be further improved from several system aspects, including load balancing, communication/computation efficiency, and memory footprint limits. In this work, we present SE-MoE that proposes Elastic MoE training with 2D prefetch and Fusion communication over Hierarchical storage, so as to enjoy efficient parallelisms in various types. For scalable inference in a single node, especially when the model size is larger than GPU memory, SE-MoE forms the CPU-GPU memory jointly into a ring of sections to load the model, and executes the computation tasks across the memory sections in a round-robin manner for efficient inference. We carried out extensive experiments to evaluate SE-MoE, where SE-MoE successfully trains a Unified Feature Optimization (UFO) model with a Sparsely-Gated Mixture-of-Experts model of 12B parameters in 8 days on 48 A100 GPU cards. The comparison against the state-of-the-art shows that SE-MoE outperformed DeepSpeed with 33% higher throughput (tokens per second) in training and 13% higher throughput in inference in general. Particularly, under unbalanced MoE Tasks, e.g., UFO, SE-MoE achieved 64% higher throughput with 18% lower memory footprints. The code of the framework will be released on: https://github.com/PaddlePaddle/Paddle.

翻译：随着当前机器学习基础设施日益多元化，异构计算系统上的分布式训练成为推动大模型生产的关键需求。混合专家（Mixture-of-Experts, MoE）模型通过门控机制与分治策略的并行化，旨在降低相对于模型/数据整体规模的训练成本。尽管DeepSpeed已在异构基础设施上实现大规模MoE训练，但其训练与推理效率仍可从负载均衡、通信/计算效率及内存占用限制等系统层面进一步优化。本文提出SE-MoE，通过层级存储上的二维预取（2D Prefetch）与融合通信（Fusion Communication）实现弹性MoE训练，从而高效支持多种并行模式。针对单节点可扩展推理（尤其当模型规模超过GPU显存时），SE-MoE将CPU-GPU内存联合构造成环形分片以加载模型，并通过轮询方式跨内存分片执行计算任务，实现高效推理。我们通过大量实验评估SE-MoE：在48块A100 GPU上，该系统仅用8天即可成功训练一个包含12B参数的稀疏门控混合专家统一特征优化（Unified Feature Optimization, UFO）模型。与最新技术的对比表明，SE-MoE在训练吞吐量（每秒处理token数）上平均超过DeepSpeed 33%，在推理吞吐量上平均提升13%。特别地，在非平衡MoE任务（如UFO）中，SE-MoE以18%更低的内存占用实现了64%的吞吐量提升。该框架代码将发布于：https://github.com/PaddlePaddle/Paddle。