Mixture-of-Experts (MoE) is a neural network architecture that adds sparsely activated expert blocks to a base model, increasing the number of parameters without impacting computational costs. However, current distributed deep learning frameworks are limited in their ability to train high-quality MoE models with large base models. In this work, we present DeepSpeed-TED, a novel, three-dimensional, hybrid parallel algorithm that combines data, tensor, and expert parallelism to enable the training of MoE models with 4 to 8x larger base models than the current state-of-the-art. We also describe memory optimizations in the optimizer step, and communication optimizations that eliminate unnecessary data movement. We implement our approach in DeepSpeed and achieve speedups of 26% over a baseline (i.e. without our communication optimizations) when training a 40 billion parameter MoE model (6.7 billion base model with 16 experts) on 128 V100 GPUs.
翻译:混合专家模型(MoE)是一种神经网络架构,通过在基础模型上添加稀疏激活的专家模块,在不影响计算成本的前提下增加参数量。然而,当前分布式深度学习框架在训练具有大规模基础模型的高质量MoE模型方面存在局限性。本文提出DeepSpeed-TED——一种创新的三维混合并行算法,该算法融合数据并行、张量并行与专家并行,能够支持训练比当前最优方法大4至8倍基础模型的MoE模型。我们还描述了优化器步骤中的内存优化,以及消除不必要数据移动的通信优化。我们在DeepSpeed中实现了该方法,在128块V100 GPU上训练一个400亿参数的MoE模型(67亿参数基础模型配备16个专家)时,相较于基线(即未采用通信优化)实现了26%的加速。