Speculative decoding enhances the efficiency of large language models (LLMs) by leveraging a draft model to draft for a larger target model to review. However, drafting in speculative decoding involves slow autoregressive generation and generating tokens of different importance with the same time allocation. These two inefficiencies lead to its suboptimal performance. To address this issue, we introduce Cascade Speculative Drafting (CS. Drafting), a novel approach that employs two types of cascades. The Vertical Cascade eliminates autoregressive generation from neural models. The Horizontal Cascade constitutes efficient time allocation in drafting with its optimality supported by our theoretical analysis. Combining both cascades, our CS. Drafting algorithm has achieved up to 72 percent additional speedup over speculative decoding in our experiments while keeping the same output distribution.
翻译:推测式解码通过利用草稿模型为大型目标模型起草内容以供审核,从而提升大型语言模型(LLM)的推理效率。然而,推测式解码中的起草过程涉及缓慢的自回归生成,以及对不同重要性的词元分配相同的生成时间。这两种低效性导致其性能并非最优。为解决此问题,我们提出级联推测式起草(CS.Drafting),一种采用两类级联结构的新方法。纵向级联消除了神经模型的自回归生成过程;横向级联则在起草中实现高效的时间分配,其最优性得到理论分析支持。结合两种级联后,我们的CS.Drafting算法在实验中实现了相比推测式解码最高达72%的额外加速,同时保持相同的输出分布。