Speculative decoding is a pivotal technique to accelerate the inference of large language models (LLMs) by employing a smaller draft model to predict the target model's outputs. However, its efficacy can be limited due to the low predictive accuracy of the draft model, particularly when faced with diverse text inputs and a significant capability gap between the draft and target models. We introduce online speculative decoding (OSD) to address this challenge. The main idea is to continually update (multiple) draft model(s) on observed user query data using the abundant excess computational power in an LLM serving cluster. Given that LLM inference is memory-bounded, the surplus computational power in a typical LLM serving cluster can be repurposed for online retraining of draft models, thereby making the training cost-neutral. Since the query distribution of an LLM service is relatively simple, retraining on query distribution enables the draft model to more accurately predict the target model's outputs, particularly on data originating from query distributions. As the draft model evolves online, it aligns with the query distribution in real time, mitigating distribution shifts. We develop a prototype of online speculative decoding based on online knowledge distillation and evaluate it using both synthetic and real query data on several popular LLMs. The results show a substantial increase in the token acceptance rate by 0.1 to 0.65, which translates into 1.22x to 3.06x latency reduction.
翻译:投机解码是一种关键技术,通过使用较小的草稿模型预测目标模型的输出,加速大型语言模型(LLMs)的推理。然而,由于草稿模型预测准确率较低(特别是在面对多样化的文本输入以及草稿模型与目标模型之间存在显著能力差距时),其有效性可能受到限制。我们引入在线投机解码(OSD)来应对这一挑战。核心思想是利用LLM服务集群中丰富的过剩计算能力,持续更新(多个)草稿模型(基于观察到的用户查询数据)。鉴于LLM推理受内存限制,典型LLM服务集群中的剩余计算能力可重新用于草稿模型的在线重训练,从而使训练成本中立。由于LLM服务的查询分布相对简单,基于查询分布的重训练能使草稿模型更准确地预测目标模型的输出,尤其是针对源自查询分布的数据。随着草稿模型在线演化,它实时匹配查询分布,从而缓解分布偏移。我们基于在线知识蒸馏开发了在线投机解码原型,并使用合成和真实查询数据在多个流行LLM上进行了评估。结果显示,令牌接受率显著提高了0.1至0.65,这转化为1.22倍至3.06倍的延迟降低。