The Mixture of Experts (MoE) model becomes an important choice of large language models nowadays because of its scalability with sublinear computational complexity for training and inference. However, existing MoE models suffer from two critical drawbacks, 1) tremendous inner-node and inter-node communication overhead introduced by all-to-all dispatching and gathering, and 2) limited scalability for the backbone because of the bound data parallel and expert parallel to scale in the expert dimension. In this paper, we systematically analyze these drawbacks in terms of training efficiency in the parallel framework view and propose a novel MoE architecture called Pipeline MoE (PPMoE) to tackle them. PPMoE builds expert parallel incorporating with tensor parallel and replaces communication-intensive all-to-all dispatching and gathering with a simple tensor index slicing and inner-node all-reduce. Besides, it is convenient for PPMoE to integrate pipeline parallel to further scale the backbone due to its flexible parallel architecture. Extensive experiments show that PPMoE not only achieves a more than $1.75\times$ speed up compared to existing MoE architectures but also reaches $90\%$ throughput of its corresponding backbone model that is $20\times$ smaller.
翻译:混合专家(Mixture of Experts, MoE)模型因其在训练和推理中具有亚线性计算复杂度的可扩展性,如今已成为大型语言模型的重要选择。然而,现有MoE模型存在两个关键缺陷:1)全收集-全分发(all-to-all)调度与聚合引入的巨大节点内和节点间通信开销;2)因数据并行与专家并行受限于专家维度扩展,导致骨干网络的可扩展性受限。本文从并行框架视角系统分析了这些缺陷对训练效率的影响,并提出一种名为Pipeline MoE(PPMoE)的新型MoE架构以解决上述问题。PPMoE将专家并行与张量并行相结合,并用简单的张量索引切片和节点内全规约(all-reduce)替代通信密集型的全收集-全分发调度与聚合操作。此外,由于其灵活的并行架构,PPMoE可便捷地集成流水线并行以进一步扩展骨干网络。大量实验表明,与现有MoE架构相比,PPMoE不仅实现了超过1.75倍的加速,其吞吐量更达到对应规模小20倍的骨干模型吞吐量的90%。