In large language models like the Generative Pre-trained Transformer, the Mixture of Experts paradigm has emerged as a powerful technique for enhancing model expressiveness and accuracy. However, deploying GPT MoE models for parallel inference on distributed systems presents significant challenges, primarily due to the extensive Alltoall communication required for expert routing and aggregation. This communication bottleneck exacerbates the already complex computational landscape, hindering the efficient utilization of high-performance computing resources. In this paper, we propose a lightweight optimization technique called ExFlow, to largely accelerate the inference of these MoE models. We take a new perspective on alleviating the communication overhead by exploiting the inter-layer expert affinity. Unlike previous methods, our solution can be directly applied to pre-trained MoE models without any fine-tuning or accuracy degradation. By proposing a context-coherent expert parallelism on distributed systems, our design only uses one Alltoall communication to deliver the same functionality while previous methods all require two Alltoalls. By carefully examining the conditional probability in tokens' routing across multiple layers, we proved that pre-trained GPT MoE models implicitly exhibit a strong inter-layer expert affinity. We then design an efficient integer programming model to capture such features and show that by properly placing the experts on corresponding GPUs, we can reduce up to 67% cross-GPU routing latency. Our solution beats the cutting-edge MoE implementations with experts from 8 to 64, with up to 2.2x improvement in inference throughput. We further provide a detailed study of how the model implicitly acquires this expert affinity at the very early training stage and how this affinity evolves and stabilizes during training.
翻译:在生成式预训练Transformer等大型语言模型中,混合专家范式已成为增强模型表现力和准确性的强大技术。然而,在分布式系统上部署GPT MoE模型进行并行推理面临重大挑战,主要原因在于专家路由和聚合所需的大量Alltoall通信。这种通信瓶颈加剧了本已复杂的计算格局,阻碍了高性能计算资源的高效利用。本文提出一种名为ExFlow的轻量级优化技术,可大幅加速此类MoE模型的推理过程。我们从缓解通信开销的新视角出发,通过挖掘层间专家亲和性来实现优化。与以往方法不同,我们的解决方案可直接应用于预训练MoE模型,无需任何微调且不损失精度。通过在分布式系统上提出上下文一致的专家并行策略,我们的设计仅需一次Alltoall通信即可实现相同功能,而先前方法均需两次Alltoall。通过仔细分析多层间令牌路由的条件概率,我们证明了预训练GPT MoE模型隐式呈现强烈的层间专家亲和性。随后我们设计了一种高效的整数规划模型来捕捉此类特征,并表明通过将专家合理部署到对应GPU上,最多可减少67%的跨GPU路由延迟。我们的方案在专家数量8至64的范围内均能超越前沿MoE实现,推理吞吐量最高提升2.2倍。此外,我们详细研究了模型在早期训练阶段如何隐式获得这种专家亲和性,以及该亲和性在训练过程中如何演变与稳定。