Online Speculative Decoding

Speculative decoding is a pivotal technique to accelerate the inference of large language models (LLMs) by employing a smaller draft model to predict the target model's outputs. However, its efficacy can be limited due to the low predictive accuracy of the draft model, particularly when faced with diverse text inputs and a significant capability gap between the draft and target models. We introduce online speculative decoding (OSD) to address this challenge. The main idea is to continually update (multiple) draft model(s) on observed user query data using the abundant excess computational power in an LLM serving cluster. Given that LLM inference is memory-bounded, the surplus computational power in a typical LLM serving cluster can be repurposed for online retraining of draft models, thereby making the training cost-neutral. Since the query distribution of an LLM service is relatively simple, retraining on query distribution enables the draft model to more accurately predict the target model's outputs, particularly on data originating from query distributions. As the draft model evolves online, it aligns with the query distribution in real time, mitigating distribution shifts. We develop a prototype of online speculative decoding based on online knowledge distillation and evaluate it using both synthetic and real query data on several popular LLMs. The results show a substantial increase in the token acceptance rate by 0.1 to 0.65, which translates into 1.22x to 3.06x latency reduction.

翻译：投机解码是一种关键技术，通过使用较小的草稿模型预测目标模型的输出，加速大型语言模型（LLMs）的推理。然而，由于草稿模型预测准确率较低（特别是在面对多样化的文本输入以及草稿模型与目标模型之间存在显著能力差距时），其有效性可能受到限制。我们引入在线投机解码（OSD）来应对这一挑战。核心思想是利用LLM服务集群中丰富的过剩计算能力，持续更新（多个）草稿模型（基于观察到的用户查询数据）。鉴于LLM推理受内存限制，典型LLM服务集群中的剩余计算能力可重新用于草稿模型的在线重训练，从而使训练成本中立。由于LLM服务的查询分布相对简单，基于查询分布的重训练能使草稿模型更准确地预测目标模型的输出，尤其是针对源自查询分布的数据。随着草稿模型在线演化，它实时匹配查询分布，从而缓解分布偏移。我们基于在线知识蒸馏开发了在线投机解码原型，并使用合成和真实查询数据在多个流行LLM上进行了评估。结果显示，令牌接受率显著提高了0.1至0.65，这转化为1.22倍至3.06倍的延迟降低。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日