A Novel Tensor-Expert Hybrid Parallelism Approach to Scale Mixture-of-Experts Training

A new neural network architecture called Mixture-of-Experts (MoE) has been proposed recently that increases the parameters of a neural network (the base model) by adding sparsely activated expert blocks, without changing the total number of floating point operations for training or inference. In theory, this architecture allows us to train arbitrarily large models while keeping the computational costs same as that of the base model. However, beyond 64 to 128 experts blocks, prior work has observed diminishing returns in the test accuracies of these MoE models. Thus, training high quality MoE models requires us to scale the size of the base models, along with the number of expert blocks. In this work, we propose a novel, three-dimensional, hybrid parallel algorithm that combines tensor, expert, and data parallelism to enable the training of MoE models with 4-8x larger base models than the current state-of-the-art -- DeepSpeed-MoE. We propose memory optimizations in the optimizer step, and communication optimizations that eliminate redundant movement of data. Removing these redundancies provides a speedup of nearly 21%. When training a 40 billion parameter MoE model (6.7 billion base model with 16 experts) on 128 V100 GPUs, our optimizations significantly improve the peak half precision flop/s from 20% to 27%.

翻译：近期提出了一种名为混合专家模型（Mixture-of-Experts, MoE）的新型神经网络架构，该架构通过添加稀疏激活的专家模块来增加神经网络（基础模型）的参数数量，同时保持训练或推理所需的浮点运算总量不变。理论上，这种架构允许我们训练任意规模的模型，同时计算成本与基础模型保持一致。然而，当专家模块数量超过64至128个时，先前研究发现这些MoE模型在测试准确率上会出现收益递减现象。因此，训练高质量的MoE模型需要同时扩增基础模型的规模与专家模块的数量。在本研究中，我们提出了一种新颖的三维混合并行算法，该算法结合了张量并行、专家并行与数据并行，能够训练比当前最先进方法DeepSpeed-MoE大4-8倍基础模型的MoE模型。我们在优化器步骤中提出了内存优化方案，并采用通信优化技术消除了冗余的数据移动。消除这些冗余后，速度提升近21%。在128块V100 GPU上训练一个400亿参数的MoE模型（包含16个专家的67亿基础模型）时，我们的优化将峰值半精度浮点运算效率从20%显著提升至27%。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日