MoE (Mixture of Experts) prevails as a neural architecture that can scale modern transformer-based LLMs (Large Language Models) to unprecedented scales. Nevertheless, large MoEs' great demands of computing power, memory capacity and memory bandwidth make scalable serving a fundamental challenge and efficient parallel inference has become a requisite to attain adequate throughput under latency constraints. DeepSpeed-MoE, one state-of-the-art MoE inference framework, adopts a 3D-parallel paradigm including EP (Expert Parallelism), TP (Tensor Parallel) and DP (Data Parallelism). However, our analysis shows DeepSpeed-MoE's inference efficiency is largely bottlenecked by EP, which is implemented with costly all-to-all collectives to route token activation. Our work aims to boost DeepSpeed-MoE by strategically reducing EP's communication overhead with a technique named Speculative MoE. Speculative MoE has two speculative parallelization schemes, speculative token shuffling and speculative expert grouping, which predict outstanding tokens' expert routing paths and pre-schedule tokens and experts across devices to losslessly trim EP's communication volume. Besides DeepSpeed-MoE, we also build Speculative MoE into a prevailing MoE inference engine SGLang. Experiments show Speculative MoE can significantly boost state-of-the-art MoE inference frameworks on fast homogeneous and slow heterogeneous interconnects.
翻译:MoE(专家混合)作为一种能够将基于Transformer的大语言模型扩展至前所未有规模的神经网络架构已得到广泛应用。然而,大规模MoE对计算能力、内存容量及内存带宽的巨大需求使其可扩展服务面临根本性挑战,高效的并行推理已成为在延迟约束下获得足够吞吐量的必要条件。DeepSpeed-MoE作为先进的MoE推理框架之一,采用了包含EP(专家并行)、TP(张量并行)与DP(数据并行)的三维并行范式。但我们的分析表明,DeepSpeed-MoE的推理效率主要受限于通过代价高昂的全收集操作实现令牌激活路由的EP。本研究旨在通过一种名为推测式MoE的技术策略性降低EP的通信开销,从而优化DeepSpeed-MoE。推测式MoE包含两种推测式并行化方案:推测令牌重排与推测专家分组,其通过预测未处理令牌的专家路由路径,并跨设备预调度令牌与专家,以无损方式削减EP的通信量。除DeepSpeed-MoE外,我们还将推测式MoE集成至主流MoE推理引擎SGLang中。实验表明,推测式MoE能在快速同构与慢速异构互连网络上显著提升先进MoE推理框架的性能。