Optimizing Mixture-of-Experts Inference Time Combining Model Deployment and Communication Scheduling

As machine learning models scale in size and complexity, their computational requirements become a significant barrier. Mixture-of-Experts (MoE) models alleviate this issue by selectively activating relevant experts. Despite this, MoE models are hindered by high communication overhead from all-to-all operations, low GPU utilization due to the synchronous communication constraint, and complications from heterogeneous GPU environments. This paper presents Aurora, which optimizes both model deployment and all-to-all communication scheduling to address these challenges in MoE inference. Aurora achieves minimal communication times by strategically ordering token transmissions in all-to-all communications. It improves GPU utilization by colocating experts from different models on the same device, avoiding the limitations of synchronous all-to-all communication. We analyze Aurora's optimization strategies theoretically across four common GPU cluster settings: exclusive vs. colocated models on GPUs, and homogeneous vs. heterogeneous GPUs. Aurora provides optimal solutions for three cases, and for the remaining NP-hard scenario, it offers a polynomial-time sub-optimal solution with only a 1.07x degradation from the optimal. Aurora is the first approach to minimize MoE inference time via optimal model deployment and communication scheduling across various scenarios. Evaluations demonstrate that Aurora significantly accelerates inference, achieving speedups of up to 2.38x in homogeneous clusters and 3.54x in heterogeneous environments. Moreover, Aurora enhances GPU utilization by up to 1.5x compared to existing methods.

翻译：随着机器学习模型规模和复杂度的增加，其计算需求成为显著瓶颈。混合专家模型通过选择性激活相关专家来缓解这一问题。尽管如此，MoE模型仍受限于全对全操作带来的高通信开销、同步通信约束导致的GPU利用率低下，以及异构GPU环境带来的复杂性。本文提出Aurora系统，通过联合优化模型部署和全对全通信调度来解决MoE推理中的这些挑战。Aurora通过在全对全通信中策略性地排序令牌传输，实现了最小化通信时间。它通过将不同模型的专家共置于同一设备上，避免了同步全对全通信的限制，从而提高了GPU利用率。我们从理论上分析了Aurora在四种常见GPU集群配置下的优化策略：GPU上独占模型与共置模型、同构GPU与异构GPU。Aurora为其中三种情况提供了最优解，而对于剩余的NP难场景，它提供了多项式时间的次优解，其性能仅比最优解低1.07倍。Aurora是首个通过在不同场景下优化模型部署和通信调度来最小化MoE推理时间的方法。评估结果表明，Aurora显著加速了推理，在同构集群中实现了高达2.38倍的加速，在异构环境中实现了高达3.54倍的加速。此外，与现有方法相比，Aurora将GPU利用率提升了高达1.5倍。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日