Tutel: Adaptive Mixture-of-Experts at Scale

Changho Hwang,Wei Cui,Yifan Xiong,Ziyue Yang,Ze Liu,Han Hu,Zilong Wang,Rafael Salas,Jithin Jose,Prabhat Ram,Joe Chau,Peng Cheng,Fan Yang,Mao Yang,Yongqiang Xiong

Sparsely-gated mixture-of-experts (MoE) has been widely adopted to scale deep learning models to trillion-plus parameters with fixed computational cost. The algorithmic performance of MoE relies on its token routing mechanism that forwards each input token to the right sub-models or experts. While token routing dynamically determines the amount of expert workload at runtime, existing systems suffer inefficient computation due to their static execution, namely static parallelism and pipelining, which does not adapt to the dynamic workload. We present Flex, a highly scalable stack design and implementation for MoE with dynamically adaptive parallelism and pipelining. Flex designs an identical layout for distributing MoE model parameters and input data, which can be leveraged by all possible parallelism or pipelining methods without any mathematical inequivalence or tensor migration overhead. This enables adaptive parallelism/pipelining optimization at zero cost during runtime. Based on this key design, Flex also implements various MoE acceleration techniques. Aggregating all techniques, Flex finally delivers huge speedup at any scale -- 4.96x and 5.75x speedup of a single MoE layer over 16 and 2,048 A100 GPUs, respectively, over the previous state-of-the-art. Our evaluation shows that Flex efficiently and effectively runs a real-world MoE-based model named SwinV2-MoE, built upon Swin Transformer V2, a state-of-the-art computer vision architecture. On efficiency, Flex accelerates SwinV2-MoE, achieving up to 1.55x and 2.11x speedup in training and inference over Fairseq, respectively. On effectiveness, the SwinV2-MoE model achieves superior accuracy in both pre-training and down-stream computer vision tasks such as COCO object detection than the counterpart dense model, indicating the readiness of Flex for end-to-end real-world model training and inference.

翻译：摘要：稀疏门控混合专家模型（MoE）已被广泛用于在固定计算成本下将深度学习模型扩展至万亿级参数。MoE的算法性能依赖于其令牌路由机制，该机制将每个输入令牌转发至正确的子模型或专家。尽管令牌路由在运行时动态决定专家工作负载量，但现有系统因采用静态执行（即静态并行与流水线）而面临计算效率低下的问题——这种静态执行无法适应动态工作负载。我们提出Flex，一种面向MoE的高可扩展性栈设计与实现，具备动态自适应并行与流水线能力。Flex设计了一种统一的布局来分布MoE模型参数与输入数据，该布局可被所有可能的并行或流水线方法直接利用，无需任何数学不等价性转换或张量迁移开销，从而在运行时实现零开销的自适应并行/流水线优化。基于此核心设计，Flex还实现了多种MoE加速技术。集成所有技术后，Flex在任何规模下均能带来显著加速——相较于此前最优方案，单个MoE层在16块和2048块A100 GPU上分别获得4.96倍和5.75倍加速。实验表明，Flex能高效运行基于实际MoE的模型SwinV2-MoE（基于当前最优计算机视觉架构Swin Transformer V2构建）。在效率方面，Flex使SwinV2-MoE的训练与推理速度较Fairseq分别提升最高1.55倍和2.11倍；在效果方面，该模型在预训练及下游计算机视觉任务（如COCO目标检测）中均取得优于同类稠密模型的精度，证明Flex已具备支持端到端实际模型训练与推理的能力。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

《DeepGCNs: Making GCNs Go as Deep as CNNs》

专知会员服务

32+阅读 · 2019年10月17日