Speculative Decoding is a widely used technique to speed up inference for Large Language Models (LLMs) without sacrificing quality. When performing inference, speculative decoding uses a smaller draft model to generate speculative tokens and then uses the target LLM to verify those draft tokens. The speedup provided by speculative decoding heavily depends on the choice of the draft model. In this work, we perform a detailed study comprising over 350 experiments with LLaMA-65B and OPT-66B using speculative decoding and delineate the factors that affect the performance gain provided by speculative decoding. Our experiments indicate that the performance of speculative decoding depends heavily on the latency of the draft model, and the draft model's capability in language modeling does not correlate strongly with its performance in speculative decoding. Based on these insights we explore a new design space for draft models and design hardware-efficient draft models for speculative decoding. Our newly designed draft model for LLaMA-65B can provide 60% higher throughput than existing draft models and can generalize further to the LLaMA-2 model family and supervised fine-tuned models.
翻译:推测性解码是一种广泛使用的技术,可在不牺牲质量的情况下加速大型语言模型(LLM)的推理。在进行推理时,推测性解码使用较小的草稿模型生成推测性令牌,然后使用目标LLM验证这些草稿令牌。推测性解码提供的加速效果很大程度上取决于草稿模型的选择。在本工作中,我们通过使用LLaMA-65B和OPT-66B进行推测性解码,开展了包含350多个实验的详细研究,并阐述了影响推测性解码性能增益的因素。实验表明,推测性解码的性能高度依赖草稿模型的延迟,而草稿模型在语言建模方面的能力与其在推测性解码中的性能并无强相关性。基于这些见解,我们探索了草稿模型的新设计空间,并设计了面向推测性解码的硬件高效草稿模型。我们为LLaMA-65B新设计的草稿模型相比现有草稿模型可提供60%的吞吐量提升,并能进一步泛化至LLaMA-2模型家族及监督微调模型。