MoE (Mixture of Experts) prevails as a neural architecture that can scale modern transformer-based LLMs (Large Language Models) to unprecedented scales. Nevertheless, large MoEs' great demands of computing power, memory capacity and memory bandwidth make scalable serving a fundamental challenge and efficient parallel inference has become a requisite to attain adequate throughput under latency constraints. DeepSpeed-MoE, one state-of-the-art MoE inference framework, adopts a 3D-parallel paradigm including EP (Expert Parallelism), TP (Tensor Parallel) and DP (Data Parallelism). However, our analysis shows DeepSpeed-MoE's inference efficiency is largely bottlenecked by EP, which is implemented with costly all-to-all collectives to route token activation. Our work aims to boost DeepSpeed-MoE by strategically reducing EP's communication overhead with a technique named Speculative MoE. Speculative MoE has two speculative parallelization schemes, speculative token shuffling and speculative expert grouping, which predict outstanding tokens' expert routing paths and pre-schedule tokens and experts across devices to losslessly trim EP's communication volume. Besides DeepSpeed-MoE, we also build Speculative MoE into a prevailing MoE inference engine SGLang. Experiments show Speculative MoE can significantly boost state-of-the-art MoE inference frameworks on fast homogeneous and slow heterogeneous interconnects.
翻译:MoE(专家混合)作为一种能够将基于Transformer的现代大语言模型扩展至空前规模的神经网络架构已得到广泛应用。然而,大规模MoE对计算能力、内存容量及内存带宽的巨大需求使得可扩展服务成为根本性挑战,在延迟约束下实现足够吞吐量必须依赖高效的并行推理。现有先进MoE推理框架DeepSpeed-MoE采用包含EP(专家并行)、TP(张量并行)与DP(数据并行)的三维并行范式。但我们的分析表明,DeepSpeed-MoE的推理效率主要受限于通过高成本全收集操作实现令牌激活路由的EP机制。本研究旨在通过名为推测式MoE的技术策略性降低EP通信开销,从而提升DeepSpeed-MoE性能。推测式MoE包含两种推测式并行方案:推测令牌重排与推测专家分组,通过预测待处理令牌的专家路由路径,并跨设备预调度令牌与专家,以无损方式削减EP通信量。除DeepSpeed-MoE外,我们还将推测式MoE集成至主流MoE推理引擎SGLang。实验表明,推测式MoE能在快速同构与慢速异构互连环境下显著提升先进MoE推理框架的性能。