Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference

In large language models like the Generative Pre-trained Transformer, the Mixture of Experts paradigm has emerged as a powerful technique for enhancing model expressiveness and accuracy. However, deploying GPT MoE models for parallel inference on distributed systems presents significant challenges, primarily due to the extensive Alltoall communication required for expert routing and aggregation. This communication bottleneck exacerbates the already complex computational landscape, hindering the efficient utilization of high-performance computing resources. In this paper, we propose a lightweight optimization technique called ExFlow, to largely accelerate the inference of these MoE models. We take a new perspective on alleviating the communication overhead by exploiting the inter-layer expert affinity. Unlike previous methods, our solution can be directly applied to pre-trained MoE models without any fine-tuning or accuracy degradation. By proposing a context-coherent expert parallelism on distributed systems, our design only uses one Alltoall communication to deliver the same functionality while previous methods all require two Alltoalls. By carefully examining the conditional probability in tokens' routing across multiple layers, we proved that pre-trained GPT MoE models implicitly exhibit a strong inter-layer expert affinity. We then design an efficient integer programming model to capture such features and show that by properly placing the experts on corresponding GPUs, we can reduce up to 67% cross-GPU routing latency. Our solution beats the cutting-edge MoE implementations with experts from 8 to 64, with up to 2.2x improvement in inference throughput. We further provide a detailed study of how the model implicitly acquires this expert affinity at the very early training stage and how this affinity evolves and stabilizes during training.

翻译：在生成式预训练Transformer等大型语言模型中，混合专家范式已成为增强模型表现力和准确性的强大技术。然而，在分布式系统上部署GPT MoE模型进行并行推理面临重大挑战，主要原因在于专家路由和聚合所需的大量Alltoall通信。这种通信瓶颈加剧了本已复杂的计算格局，阻碍了高性能计算资源的高效利用。本文提出一种名为ExFlow的轻量级优化技术，可大幅加速此类MoE模型的推理过程。我们从缓解通信开销的新视角出发，通过挖掘层间专家亲和性来实现优化。与以往方法不同，我们的解决方案可直接应用于预训练MoE模型，无需任何微调且不损失精度。通过在分布式系统上提出上下文一致的专家并行策略，我们的设计仅需一次Alltoall通信即可实现相同功能，而先前方法均需两次Alltoall。通过仔细分析多层间令牌路由的条件概率，我们证明了预训练GPT MoE模型隐式呈现强烈的层间专家亲和性。随后我们设计了一种高效的整数规划模型来捕捉此类特征，并表明通过将专家合理部署到对应GPU上，最多可减少67%的跨GPU路由延迟。我们的方案在专家数量8至64的范围内均能超越前沿MoE实现，推理吞吐量最高提升2.2倍。此外，我们详细研究了模型在早期训练阶段如何隐式获得这种专家亲和性，以及该亲和性在训练过程中如何演变与稳定。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

专知会员服务

36+阅读 · 2020年5月20日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日