Sparse Mixture of Experts (SMoE) performs conditional computation by selectively activating a subset of experts, thereby enabling scalable parameter growth in large language models (LLMs). However, the expanded parameter scale exceeds the memory capacity of a single device, necessitating distributed deployment for inference. This setup introduces two critical challenges: (1) Communication Issue: Transferring features to devices with activated experts leads to significant communication overhead. (2) Computational Load Issue: Skewed expert activation overloads certain GPUs, resulting in load imbalance across devices. Among these, communication overhead is identified as the main bottleneck in SMoE inference. Nevertheless, reducing communication between devices may exacerbate computational load imbalance, leading to device idleness and resource waste. Therefore, we present GRACE-MoE, short for Grouping and Replication with Locality-Aware Routing for SMoE inference. GRACE-MoE is a co-optimization framework that jointly reduces communication overhead and alleviates computational load imbalance. Specifically, the framework comprises two key phases: (1) Grouping & Replication: This phase groups experts based on their affinity to reduce cross-device communication. Additionally, dynamic replication is applied to address load skew, improving computational load balance across GPUs. (2) Routing: This phase employs a locality-aware routing strategy with load prediction. It prioritizes local replicas to minimize communication overhead and balances requests across remote replicas when necessary. Experiments on diverse models and multi-node, multi-GPU environments demonstrate that GRACE-MoE efficiently reduces end-to-end inference latency, achieving up to 3.79x speedup over state-of-the-art systems. Code for GRACE-MoE will be released upon acceptance.
翻译:稀疏专家混合模型通过有条件地选择激活部分专家,实现了大规模语言模型中参数规模的可扩展增长。然而,扩展后的参数量超出了单设备的内存容量,因此推理过程需要分布式部署。这种部署方式带来了两个关键挑战:(1)通信问题:将特征传输至激活专家所在的设备会产生显著的通信开销。(2)计算负载问题:不均衡的专家激活会导致部分GPU过载,从而造成设备间的负载失衡。其中,通信开销被确定为SMoE推理中的主要瓶颈。然而,减少设备间通信可能会加剧计算负载的不均衡,导致设备空闲和资源浪费。为此,我们提出了GRACE-MoE,全称为面向SMoE推理的基于局部性感知路由的分组与复制。GRACE-MoE是一个协同优化框架,旨在联合降低通信开销并缓解计算负载不均衡。具体而言,该框架包含两个关键阶段:(1)分组与复制:此阶段基于专家间的亲和性对专家进行分组,以减少跨设备通信。此外,应用动态复制来处理负载倾斜,改善GPU间的计算负载均衡。(2)路由:此阶段采用一种结合负载预测的局部性感知路由策略。它优先选择本地副本以最小化通信开销,并在必要时将请求均衡地分发至远程副本。在不同模型以及多节点、多GPU环境上的实验表明,GRACE-MoE能有效降低端到端推理延迟,相比最先进的系统实现了最高3.79倍的加速。GRACE-MoE的代码将在论文被接收后开源。