Speculative decoding enhances the efficiency of large language models (LLMs) by leveraging a draft model to draft for a larger target model to review. However, drafting in speculative decoding involves slow autoregressive generation and generating tokens of different importance with the same time allocation. These two inefficiencies lead to its suboptimal performance. To address this issue, we introduce Cascade Speculative Drafting (CS. Drafting), a novel approach that employs two types of cascades. The Vertical Cascade eliminates autoregressive generation from neural models. The Horizontal Cascade constitutes efficient time allocation in drafting with its optimality supported by our theoretical analysis. Combining both cascades, our CS. Drafting algorithm has achieved up to 72 percent additional speedup over speculative decoding in our experiments while keeping the same output distribution.
翻译:推测解码通过利用草稿模型为更大的目标模型起草内容以供审核,从而提升大型语言模型(LLM)的效率。然而,推测解码中的起草过程涉及缓慢的自回归生成,并为不同重要性的令牌分配相同的时间。这两种低效性导致了其性能欠佳。为解决此问题,我们引入了级联推测起草(CS. Drafting),一种采用两种级联类型的新方法。垂直级联消除了神经模型中的自回归生成。水平级联在起草中实现了高效的时间分配,且我们通过理论分析证明了其最优性。结合两种级联,我们的CS. Drafting算法在实验中相较于推测解码实现了高达72%的额外加速,同时保持相同的输出分布。