The Mixture-of-Experts (MoE) architecture has been widely adopted in large language models (LLMs) to reduce computation cost through model sparsity. Employing speculative decoding (SD) can further accelerate MoE inference by drafting multiple tokens per step and verifying them in parallel. However, combining MoE with SD inflates GPU memory and aggravates CPU-GPU bandwidth contention during multi-token verification. Existing MoE offloading systems are SD-agnostic and do not address this bottleneck. We present SP-MoE, the first SD-aware expert-offloading and compute-communication pipelining framework. SP-MoE introduces: (1) speculative expert prefetching that exploits structural correspondence between the draft and target models to prefetch likely experts ahead of verification; (2) a cutoff-layer policy that bounds per-layer prefetch depth based on empirical profiles and an analytical latency model, guaranteeing just-in-time availability without overfetch; and (3) a pipelined runtime with asynchronous prefetch threads and batched I/O to hide loading latency. Extensive experiments demonstrate that SP-MoE achieves a 1.07-3.5 times TPOT speedup over state-of-the-art methods across diverse datasets, environments, and MoE-based models.
翻译:混合专家(Mixture-of-Experts,MoE)架构已在大语言模型(LLMs)中得到广泛应用,通过模型稀疏性降低计算成本。采用推测解码(Speculative Decoding,SD)技术可通过每步草拟多个令牌并进行并行验证,进一步加速MoE推理。然而,将MoE与SD结合会加剧GPU内存占用,并在多令牌验证期间激化CPU-GPU带宽争用。现有MoE卸载系统未针对SD进行优化,无法解决此瓶颈。本文提出SP-MoE——首个具备SD感知能力的专家卸载及计算-通信流水线框架。SP-MoE引入:(1)推测式专家预取技术,利用草拟模型与目标模型间的结构对应关系,在验证前预取可能调用的专家;(2)基于截止层的策略,通过经验性能分析与理论延迟模型限定每层预取深度,确保专家即时可用性并避免过度预取;(3)配备异步预取线程与批处理I/O的流水线运行时系统,以隐藏加载延迟。大量实验表明,在不同数据集、运行环境及基于MoE的模型上,SP-MoE相比现有最优方法实现了1.07-3.5倍的每令牌推理时间加速。