Speculative decoding (SD) accelerates large language model inference by employing a faster draft model for generating multiple tokens, which are then verified in parallel by the larger target model, resulting in the text generated according to the target model distribution. However, identifying a compact draft model that is well-aligned with the target model is challenging. To tackle this issue, we propose DistillSpec that uses knowledge distillation to better align the draft model with the target model, before applying SD. DistillSpec makes two key design choices, which we demonstrate via systematic study to be crucial to improving the draft and target alignment: utilizing on-policy data generation from the draft model, and tailoring the divergence function to the task and decoding strategy. Notably, DistillSpec yields impressive 10 - 45% speedups over standard SD on a range of standard benchmarks, using both greedy and non-greedy sampling. Furthermore, we combine DistillSpec with lossy SD to achieve fine-grained control over the latency vs. task performance trade-off. Finally, in practical scenarios with models of varying sizes, first using distillation to boost the performance of the target model and then applying DistillSpec to train a well-aligned draft model can reduce decoding latency by 6-10x with minimal performance drop, compared to standard decoding without distillation.
翻译:推测解码(SD)通过使用更快的草稿模型生成多个令牌,然后由更大的目标模型并行验证,从而加速大型语言模型的推理,使生成的文本符合目标模型分布。然而,找到与目标模型良好对齐的紧凑型草稿模型极具挑战性。为解决此问题,我们提出DistillSpec,在应用SD之前使用知识蒸馏使草稿模型与目标模型更好地对齐。DistillSpec做出两个关键设计选择,我们通过系统性研究证明这些选择对改进草稿与目标对齐至关重要:利用来自草稿模型的在线数据生成,以及根据任务和解码策略定制散度函数。值得注意的是,在多种标准基准测试中,使用贪婪采样和非贪婪采样时,DistillSpec相较于标准SD实现了10%-45%的显著加速。此外,我们将DistillSpec与有损SD相结合,以实现对延迟与任务性能权衡的精细控制。最后,在模型规模各异的实际场景中,先使用蒸馏提升目标模型性能,再应用DistillSpec训练良好对齐的草稿模型,与未使用蒸馏的标准解码相比,可将解码延迟降低6-10倍,同时性能下降极小。