Speculative Decoding is a widely used technique to speed up inference for Large Language Models (LLMs) without modifying its outcome. When performing inference on an LLM, speculative decoding uses a smaller draft model which generates speculative tokens and then uses the target LLM to verify those draft tokens. The speedup provided by speculative decoding heavily depends on the choice of the draft model. It has been widely suggested to select a draft model that provides a high probability of the generated token being accepted by the LLM to achieve the highest throughput. However, our experiments indicate the contrary with throughput diminishing as the probability of generated tokens to be accepted by the target model increases. To understand this phenomenon, we perform extensive experiments to characterize the different factors that affect speculative decoding and how those factors interact and affect the speedups. Based on our experiments we describe an analytical model which can be used to decide the right draft model for a given workload. Further, using our insights we design a new draft model for LLaMA-65B which can provide 30% higher throughput than existing draft models.
翻译:推测解码是一种广泛应用的技术,用于加速大型语言模型(LLMs)的推理过程,且不会改变其输出结果。在对LLM执行推理时,推测解码使用一个较小的草稿模型生成推测性令牌,然后利用目标LLM验证这些草稿令牌。推测解码带来的加速效果在很大程度上取决于草稿模型的选择。业界普遍建议选择能够使生成令牌被LLM接受概率较高的草稿模型,以实现最高吞吐量。然而,我们的实验结果表明,随着生成令牌被目标模型接受的概率增加,吞吐量反而下降。为了理解这一现象,我们开展了大量实验,以刻画影响推测解码的不同因素,并研究这些因素如何相互作用并影响加速效果。基于实验,我们描述了一个分析模型,可用于为给定工作负载选择适当的草稿模型。此外,利用我们的洞察,我们为LLaMA-65B设计了一种新的草稿模型,相比现有草稿模型,其吞吐量可提高30%。