DistillSpec: Improving Speculative Decoding via Knowledge Distillation

Speculative decoding (SD) accelerates large language model inference by employing a faster draft model for generating multiple tokens, which are then verified in parallel by the larger target model, resulting in the text generated according to the target model distribution. However, identifying a compact draft model that is well-aligned with the target model is challenging. To tackle this issue, we propose DistillSpec that uses knowledge distillation to better align the draft model with the target model, before applying SD. DistillSpec makes two key design choices, which we demonstrate via systematic study to be crucial to improving the draft and target alignment: utilizing on-policy data generation from the draft model, and tailoring the divergence function to the task and decoding strategy. Notably, DistillSpec yields impressive 10 - 45% speedups over standard SD on a range of standard benchmarks, using both greedy and non-greedy sampling. Furthermore, we combine DistillSpec with lossy SD to achieve fine-grained control over the latency vs. task performance trade-off. Finally, in practical scenarios with models of varying sizes, first using distillation to boost the performance of the target model and then applying DistillSpec to train a well-aligned draft model can reduce decoding latency by 6-10x with minimal performance drop, compared to standard decoding without distillation.

翻译：推测解码（SD）通过使用更快的草稿模型生成多个令牌，然后由更大的目标模型并行验证，从而加速大型语言模型的推理，使生成的文本符合目标模型分布。然而，找到与目标模型良好对齐的紧凑型草稿模型极具挑战性。为解决此问题，我们提出DistillSpec，在应用SD之前使用知识蒸馏使草稿模型与目标模型更好地对齐。DistillSpec做出两个关键设计选择，我们通过系统性研究证明这些选择对改进草稿与目标对齐至关重要：利用来自草稿模型的在线数据生成，以及根据任务和解码策略定制散度函数。值得注意的是，在多种标准基准测试中，使用贪婪采样和非贪婪采样时，DistillSpec相较于标准SD实现了10%-45%的显著加速。此外，我们将DistillSpec与有损SD相结合，以实现对延迟与任务性能权衡的精细控制。最后，在模型规模各异的实际场景中，先使用蒸馏提升目标模型性能，再应用DistillSpec训练良好对齐的草稿模型，与未使用蒸馏的标准解码相比，可将解码延迟降低6-10倍，同时性能下降极小。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日