Sparsely-gated mixture-of-experts (MoE) has been widely adopted to scale deep learning models to trillion-plus parameters with fixed computational cost. The algorithmic performance of MoE relies on its token routing mechanism that forwards each input token to the right sub-models or experts. While token routing dynamically determines the amount of expert workload at runtime, existing systems suffer inefficient computation due to their static execution, namely static parallelism and pipelining, which does not adapt to the dynamic workload. We present Flex, a highly scalable stack design and implementation for MoE with dynamically adaptive parallelism and pipelining. Flex designs an identical layout for distributing MoE model parameters and input data, which can be leveraged by all possible parallelism or pipelining methods without any mathematical inequivalence or tensor migration overhead. This enables adaptive parallelism/pipelining optimization at zero cost during runtime. Based on this key design, Flex also implements various MoE acceleration techniques. Aggregating all techniques, Flex finally delivers huge speedup at any scale -- 4.96x and 5.75x speedup of a single MoE layer over 16 and 2,048 A100 GPUs, respectively, over the previous state-of-the-art. Our evaluation shows that Flex efficiently and effectively runs a real-world MoE-based model named SwinV2-MoE, built upon Swin Transformer V2, a state-of-the-art computer vision architecture. On efficiency, Flex accelerates SwinV2-MoE, achieving up to 1.55x and 2.11x speedup in training and inference over Fairseq, respectively. On effectiveness, the SwinV2-MoE model achieves superior accuracy in both pre-training and down-stream computer vision tasks such as COCO object detection than the counterpart dense model, indicating the readiness of Flex for end-to-end real-world model training and inference.
翻译:摘要:稀疏门控混合专家模型(MoE)已被广泛用于在固定计算成本下将深度学习模型扩展至万亿级参数。MoE的算法性能依赖于其令牌路由机制,该机制将每个输入令牌转发至正确的子模型或专家。尽管令牌路由在运行时动态决定专家工作负载量,但现有系统因采用静态执行(即静态并行与流水线)而面临计算效率低下的问题——这种静态执行无法适应动态工作负载。我们提出Flex,一种面向MoE的高可扩展性栈设计与实现,具备动态自适应并行与流水线能力。Flex设计了一种统一的布局来分布MoE模型参数与输入数据,该布局可被所有可能的并行或流水线方法直接利用,无需任何数学不等价性转换或张量迁移开销,从而在运行时实现零开销的自适应并行/流水线优化。基于此核心设计,Flex还实现了多种MoE加速技术。集成所有技术后,Flex在任何规模下均能带来显著加速——相较于此前最优方案,单个MoE层在16块和2048块A100 GPU上分别获得4.96倍和5.75倍加速。实验表明,Flex能高效运行基于实际MoE的模型SwinV2-MoE(基于当前最优计算机视觉架构Swin Transformer V2构建)。在效率方面,Flex使SwinV2-MoE的训练与推理速度较Fairseq分别提升最高1.55倍和2.11倍;在效果方面,该模型在预训练及下游计算机视觉任务(如COCO目标检测)中均取得优于同类稠密模型的精度,证明Flex已具备支持端到端实际模型训练与推理的能力。