A new neural network architecture called Mixture-of-Experts (MoE) has been proposed recently that increases the parameters of a neural network (the base model) by adding sparsely activated expert blocks, without changing the total number of floating point operations for training or inference. In theory, this architecture allows us to train arbitrarily large models while keeping the computational costs same as that of the base model. However, beyond 64 to 128 experts blocks, prior work has observed diminishing returns in the test accuracies of these MoE models. Thus, training high quality MoE models requires us to scale the size of the base models, along with the number of expert blocks. In this work, we propose a novel, three-dimensional, hybrid parallel algorithm that combines tensor, expert, and data parallelism to enable the training of MoE models with 4-8x larger base models than the current state-of-the-art -- DeepSpeed-MoE. We propose memory optimizations in the optimizer step, and communication optimizations that eliminate redundant movement of data. Removing these redundancies provides a speedup of nearly 21%. When training a 40 billion parameter MoE model (6.7 billion base model with 16 experts) on 128 V100 GPUs, our optimizations significantly improve the peak half precision flop/s from 20% to 27%.
翻译:近期提出了一种名为混合专家模型(Mixture-of-Experts, MoE)的新型神经网络架构,该架构通过添加稀疏激活的专家模块来增加神经网络(基础模型)的参数数量,同时保持训练或推理所需的浮点运算总量不变。理论上,这种架构允许我们训练任意规模的模型,同时计算成本与基础模型保持一致。然而,当专家模块数量超过64至128个时,先前研究发现这些MoE模型在测试准确率上会出现收益递减现象。因此,训练高质量的MoE模型需要同时扩增基础模型的规模与专家模块的数量。在本研究中,我们提出了一种新颖的三维混合并行算法,该算法结合了张量并行、专家并行与数据并行,能够训练比当前最先进方法DeepSpeed-MoE大4-8倍基础模型的MoE模型。我们在优化器步骤中提出了内存优化方案,并采用通信优化技术消除了冗余的数据移动。消除这些冗余后,速度提升近21%。在128块V100 GPU上训练一个400亿参数的MoE模型(包含16个专家的67亿基础模型)时,我们的优化将峰值半精度浮点运算效率从20%显著提升至27%。