Mixture-of-Experts (MoE) model architecture has emerged as a promising solution for scaling transformer models efficiently, offering sparse activation that reduces computational costs while increasing model capacity. However, as MoE models scale, they need to be distributed across GPU devices, thus face critical performance bottlenecks due to their large memory footprint. Expert parallelism distributes experts across GPUs, however, faces key challenges including an unbalanced token routing and expert activation, resulting in communication tail latency and processing inefficiencies. While existing solutions address some of these issues, they fail to resolve the dual challenges of load imbalance and communication skew. The imbalance in token processing load across experts causes uneven processing times on different GPUs, while communication skew between GPUs leads to unbalanced inter-GPU data transfers. These factors degrade the performance of MoE models by increasing tail latency and reducing overall throughput. To address these limitations, we propose an Integer Linear Programming (ILP) formulation to optimize expert placement by jointly considering token load, communication, and computation costs. We exploit the property that there is a token routing dependency across layers, where tokens routed to a specific expert in one layer are likely to be routed to a limited set of experts in the subsequent layer. Our solution, MoETuner, offers an optimal expert-to-GPU assignment that minimizes inter-GPU token routing costs and balances token processing across devices, thereby reducing tail latency and end-to-end execution time. Experimental results demonstrate 9.3% and 17.5% of end-to-end speedups for single-node and multi-node inference respectively, showcasing the potential of our ILP-based optimization for offering expert parallel solutions for next-generation MoEs.
翻译:混合专家(MoE)模型架构已成为高效扩展Transformer模型的一种有前景的解决方案,其稀疏激活特性在提升模型容量的同时降低了计算成本。然而,随着MoE模型规模的扩大,它们需要被分布到多个GPU设备上,从而因其巨大的内存占用而面临关键的性能瓶颈。专家并行技术将专家分布到不同GPU上,但仍面临令牌路由与专家激活不均衡等核心挑战,导致通信尾延迟和处理效率低下。现有解决方案虽能处理部分问题,却未能同时解决负载不均衡与通信倾斜的双重挑战。专家间令牌处理负载的不均衡导致不同GPU上的处理时间不均,而GPU间的通信倾斜则造成跨GPU数据传输的不平衡。这些因素通过增加尾延迟和降低整体吞吐量,损害了MoE模型的性能。为应对这些局限,我们提出一种整数线性规划(ILP)公式,通过联合考虑令牌负载、通信和计算成本来优化专家放置。我们利用了层间令牌路由的依赖性特性:即在一层中被路由至特定专家的令牌,在后续层中很可能被路由至一个有限的专家集合。我们的解决方案MoETuner提供了一种最优的专家到GPU分配策略,该策略最小化了跨GPU令牌路由成本,并均衡了设备间的令牌处理,从而降低了尾延迟和端到端执行时间。实验结果表明,在单节点和多节点推理场景下分别实现了9.3%和17.5%的端到端加速,这展示了我们基于ILP的优化方法为下一代MoE模型提供专家并行解决方案的潜力。