Speculative decoding (SD), where an extra draft model is employed to provide multiple \textit{draft} tokens first and then the original target model verifies these tokens in parallel, has shown great power for LLM inference acceleration. However, existing SD methods suffer from the mutual waiting problem, i.e., the target model gets stuck when the draft model is \textit{guessing} tokens, and vice versa. This problem is directly incurred by the asynchronous execution of the draft model and the target model, and is exacerbated due to the fixed draft length in speculative decoding. To address these challenges, we propose a conceptually simple, flexible, and general framework to boost speculative decoding, namely \textbf{P}arallel sp\textbf{E}culative decoding with \textbf{A}daptive d\textbf{R}aft \textbf{L}ength (PEARL). Specifically, PEARL proposes \textit{pre-verify} to verify the first draft token in advance during the drafting phase, and \textit{post-verify} to generate more draft tokens during the verification phase. PEARL parallels the drafting phase and the verification phase via applying the two strategies, and achieves adaptive draft length for different scenarios, which effectively alleviates the mutual waiting problem. Moreover, we theoretically demonstrate that the mean accepted tokens of PEARL is more than existing \textit{draft-then-verify} works. Experiments on various text generation benchmarks demonstrate the effectiveness of our \name, leading to a superior speedup performance up to \textbf{3.79$\times$} and \textbf{1.52$\times$}, compared to auto-regressive decoding and vanilla speculative decoding, respectively.
翻译:推测解码(SD)通过引入额外的草稿模型首先生成多个\textit{草稿}令牌,随后由原始目标模型并行验证这些令牌,已展现出在大语言模型推理加速方面的巨大潜力。然而,现有SD方法存在相互等待问题,即当草稿模型正在\textit{猜测}令牌时,目标模型会陷入停滞,反之亦然。该问题直接由草稿模型与目标模型的异步执行所导致,并因推测解码中固定的草稿长度而加剧。为应对这些挑战,我们提出了一个概念简单、灵活且通用的框架来增强推测解码,即具有自适应草稿长度的并行推测解码(PEARL)。具体而言,PEARL提出\textit{预验证}以在草稿阶段提前验证首个草稿令牌,并提出\textit{后验证}以在验证阶段生成更多草稿令牌。PEARL通过应用这两种策略实现了草稿阶段与验证阶段的并行化,并为不同场景实现了自适应草稿长度,从而有效缓解了相互等待问题。此外,我们从理论上证明了PEARL的平均接受令牌数超过现有的\textit{先草稿后验证}方法。在多种文本生成基准测试上的实验证明了我们方法的有效性,相较于自回归解码和原始推测解码,分别实现了高达\textbf{3.79$\times$}和\textbf{1.52$\times$}的卓越加速性能。