Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting

Speculative decoding has demonstrated its effectiveness in accelerating the inference of large language models while maintaining a consistent sampling distribution. However, the conventional approach of training a separate draft model to achieve a satisfactory token acceptance rate can be costly. Drawing inspiration from early exiting, we propose a novel self-speculative decoding framework \emph{Kangaroo}, which uses a fixed shallow sub-network as a self-draft model, with the remaining layers serving as the larger target model. We train a lightweight and efficient adapter module on top of the sub-network to bridge the gap between the sub-network and the full model's representation ability. It is noteworthy that the inference latency of the self-draft model may no longer be negligible compared to the large model, necessitating strategies to increase the token acceptance rate while minimizing the drafting steps of the small model. To address this challenge, we introduce an additional early exiting mechanism for generating draft tokens. Specifically, we halt the small model's subsequent prediction during the drafting phase once the confidence level for the current token falls below a certain threshold. Extensive experiments on the Spec-Bench demonstrate the effectiveness of Kangaroo. Under single-sequence verification, Kangaroo achieves speedups up to $1.68\times$ on Spec-Bench, outperforming Medusa-1 with 88.7\% fewer additional parameters (67M compared to 591M). The code for Kangaroo is available at https://github.com/Equationliu/Kangaroo.

翻译：推测解码在保持采样分布一致性的同时，已被证实能有效加速大型语言模型的推理过程。然而，传统方法需要训练独立的草稿模型以实现可观的令牌接受率，这通常代价高昂。受早退思想启发，我们提出一种新颖的自推测解码框架——Kangaroo，通过固定浅层子网络作为自草稿模型，而剩余层作为更大的目标模型。我们在子网络上训练一个轻量高效的适配器模块，以弥合子网络与完整模型表示能力之间的差距。值得注意的是，自草稿模型的推理延迟相较于大型模型已不可忽视，因此需要制定策略在提升令牌接受率的同时减少小型模型的草稿生成步骤。为应对这一挑战，我们引入了额外的早退机制来生成草稿令牌。具体而言，在草稿生成阶段，一旦当前令牌的置信度低于预设阈值，我们即终止小型模型的后续预测。在Spec-Bench上的大量实验证明了Kangaroo的有效性。在单序列验证条件下，Kangaroo在Spec-Bench上实现了最高$1.68\times$的加速比，同时以较Medusa-1减少88.7%的额外参数（6700万对比5.91亿）取得更优性能。Kangaroo代码已开源至https://github.com/Equationliu/Kangaroo。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日