Speculative decoding accelerates large language model (LLM) inference by using a small draft model to generate candidate tokens for a larger target model to verify. The efficacy of this technique hinges on the trade-off between the time spent on drafting candidates and verifying them. However, current state-of-the-art methods rely on a static time allocation, while recent dynamic approaches optimize for proxy metrics like acceptance length, often neglecting the true time cost and treating the drafting and verification phases in isolation. To address these limitations, we introduce Learning to Draft (LTD), a novel method that directly optimizes for throughput of each draft-and-verify cycle. We formulate the problem as a reinforcement learning environment and train two co-adaptive policies to dynamically coordinate the draft and verification phases. This encourages the policies to adapt to each other and explicitly maximize decoding efficiency. We conducted extensive evaluations on five diverse LLMs and four distinct tasks. Our results show that LTD achieves speedup ratios ranging from 2.24x to 4.32x, outperforming the state-of-the-art method Eagle3 up to 36.4%.
翻译:推测解码通过使用小型起草模型生成候选标记供大型目标模型验证,从而加速大语言模型(LLM)的推理过程。该技术的有效性取决于起草候选标记与验证这些标记之间的时间权衡。然而,当前最先进的方法依赖于静态的时间分配策略,而近期的动态方法则针对接受长度等代理指标进行优化,往往忽略了真实的时间成本,并将起草阶段与验证阶段孤立处理。为应对这些局限性,我们提出了学习起草(LTD)这一新方法,该方法直接优化每个“起草-验证”周期的吞吐量。我们将该问题构建为一个强化学习环境,并训练两个协同自适应策略来动态协调起草与验证阶段。这促使策略相互适应并显式地最大化解码效率。我们在五个不同的大语言模型和四个独立任务上进行了广泛评估。结果表明,LTD实现了2.24倍至4.32倍的加速比,其性能超越当前最先进的Eagle3方法达36.4%。