With the increasing diversity of ML infrastructures nowadays, distributed training over heterogeneous computing systems is desired to facilitate the production of big models. Mixture-of-Experts (MoE) models have been proposed to lower the cost of training subject to the overall size of models/data through gating and parallelism in a divide-and-conquer fashion. While DeepSpeed has made efforts in carrying out large-scale MoE training over heterogeneous infrastructures, the efficiency of training and inference could be further improved from several system aspects, including load balancing, communication/computation efficiency, and memory footprint limits. In this work, we present SE-MoE that proposes Elastic MoE training with 2D prefetch and Fusion communication over Hierarchical storage, so as to enjoy efficient parallelisms in various types. For scalable inference in a single node, especially when the model size is larger than GPU memory, SE-MoE forms the CPU-GPU memory jointly into a ring of sections to load the model, and executes the computation tasks across the memory sections in a round-robin manner for efficient inference. We carried out extensive experiments to evaluate SE-MoE, where SE-MoE successfully trains a Unified Feature Optimization (UFO) model with a Sparsely-Gated Mixture-of-Experts model of 12B parameters in 8 days on 48 A100 GPU cards. The comparison against the state-of-the-art shows that SE-MoE outperformed DeepSpeed with 33% higher throughput (tokens per second) in training and 13% higher throughput in inference in general. Particularly, under unbalanced MoE Tasks, e.g., UFO, SE-MoE achieved 64% higher throughput with 18% lower memory footprints. The code of the framework will be released on: https://github.com/PaddlePaddle/Paddle.
翻译:随着当前机器学习基础设施日益多元化,异构计算系统上的分布式训练成为推动大模型生产的关键需求。混合专家(Mixture-of-Experts, MoE)模型通过门控机制与分治策略的并行化,旨在降低相对于模型/数据整体规模的训练成本。尽管DeepSpeed已在异构基础设施上实现大规模MoE训练,但其训练与推理效率仍可从负载均衡、通信/计算效率及内存占用限制等系统层面进一步优化。本文提出SE-MoE,通过层级存储上的二维预取(2D Prefetch)与融合通信(Fusion Communication)实现弹性MoE训练,从而高效支持多种并行模式。针对单节点可扩展推理(尤其当模型规模超过GPU显存时),SE-MoE将CPU-GPU内存联合构造成环形分片以加载模型,并通过轮询方式跨内存分片执行计算任务,实现高效推理。我们通过大量实验评估SE-MoE:在48块A100 GPU上,该系统仅用8天即可成功训练一个包含12B参数的稀疏门控混合专家统一特征优化(Unified Feature Optimization, UFO)模型。与最新技术的对比表明,SE-MoE在训练吞吐量(每秒处理token数)上平均超过DeepSpeed 33%,在推理吞吐量上平均提升13%。特别地,在非平衡MoE任务(如UFO)中,SE-MoE以18%更低的内存占用实现了64%的吞吐量提升。该框架代码将发布于:https://github.com/PaddlePaddle/Paddle。