SP-MoE: Speculative Decoding and Prefetching for Accelerating MoE-based Model Inference

The Mixture-of-Experts (MoE) architecture has been widely adopted in large language models (LLMs) to reduce computation cost through model sparsity. Employing speculative decoding (SD) can further accelerate MoE inference by drafting multiple tokens per step and verifying them in parallel. However, combining MoE with SD inflates GPU memory and aggravates CPU-GPU bandwidth contention during multi-token verification. Existing MoE offloading systems are SD-agnostic and do not address this bottleneck. We present SP-MoE, the first SD-aware expert-offloading and compute-communication pipelining framework. SP-MoE introduces: (1) speculative expert prefetching that exploits structural correspondence between the draft and target models to prefetch likely experts ahead of verification; (2) a cutoff-layer policy that bounds per-layer prefetch depth based on empirical profiles and an analytical latency model, guaranteeing just-in-time availability without overfetch; and (3) a pipelined runtime with asynchronous prefetch threads and batched I/O to hide loading latency. Extensive experiments demonstrate that SP-MoE achieves a 1.07-3.5 times TPOT speedup over state-of-the-art methods across diverse datasets, environments, and MoE-based models.

翻译：混合专家（Mixture-of-Experts，MoE）架构已在大语言模型（LLMs）中得到广泛应用，通过模型稀疏性降低计算成本。采用推测解码（Speculative Decoding，SD）技术可通过每步草拟多个令牌并进行并行验证，进一步加速MoE推理。然而，将MoE与SD结合会加剧GPU内存占用，并在多令牌验证期间激化CPU-GPU带宽争用。现有MoE卸载系统未针对SD进行优化，无法解决此瓶颈。本文提出SP-MoE——首个具备SD感知能力的专家卸载及计算-通信流水线框架。SP-MoE引入：（1）推测式专家预取技术，利用草拟模型与目标模型间的结构对应关系，在验证前预取可能调用的专家；（2）基于截止层的策略，通过经验性能分析与理论延迟模型限定每层预取深度，确保专家即时可用性并避免过度预取；（3）配备异步预取线程与批处理I/O的流水线运行时系统，以隐藏加载延迟。大量实验表明，在不同数据集、运行环境及基于MoE的模型上，SP-MoE相比现有最优方法实现了1.07-3.5倍的每令牌推理时间加速。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日