Sparse Mixture of Experts (SMoE) enables scalable parameter growth in large language models (LLMs) by selectively activating a subset of experts, and its large parameter count necessitates distributed deployment for inference. However, distributed inference faces a critical dilemma: although communication overhead constitutes the primary bottleneck, reducing it often exacerbates computational load imbalance, leading to resource waste. In this paper, we present GRACE-MoE, which stands for Grouping and Replication with Locality-Aware Routing for SMoE inference. GRACE-MoE is a lossless co-optimization framework that integrates expert grouping to reduce communication and dynamic replication to correct load skew, together with locality-aware routing to resolve replica selection. To underpin this coordinated optimization in multi-node settings, GRACE-MoE adopts a hierarchical sparse communication design that reduces cross-node traffic while implicitly aligning execution across nodes, thereby mitigating synchronization overhead. Experiments on diverse models and multi-node, multi-GPU environments demonstrate that GRACE-MoE efficiently reduces end-to-end inference latency, achieving up to 4.66x speedup over existing systems, and the code will be released upon acceptance.
翻译:稀疏混合专家模型(SMoE)通过选择性激活专家子集实现大型语言模型(LLM)参数的可扩展增长,其庞大的参数量需要分布式部署来支持推理。然而,分布式推理面临关键困境:尽管通信开销构成主要瓶颈,但减少通信往往加剧计算负载不均衡,导致资源浪费。本文提出GRACE-MoE——面向SMoE推理的分组与复制结合局部性感知路由框架。GRACE-MoE是一个无损协同优化框架,通过专家分组减少通信、动态复制修正负载倾斜,并结合局部性感知路由解决副本选择问题。为支撑多节点场景下的协同优化,GRACE-MoE采用层次化稀疏通信设计,在降低跨节点流量的同时隐式协调节点间执行过程,从而缓解同步开销。在多种模型及多节点多GPU环境下的实验表明,GRACE-MoE能高效降低端到端推理延迟,相较于现有系统可实现最高4.66倍加速,代码将在录用后开源。