Speculative decoding has become the standard approach for accelerating Large Language Model (LLM) inference. It exploits a lossless draft-then-verify procedure to circumvent the latency of autoregressive decoding, achieving impressive speed-ups. Yet, current speculative decoding approaches remain limited by two fundamental bottlenecks: (1) the autoregressive dependency during drafting which limits parallelism, and (2) frequent rejections of draft tokens caused by misalignment between the draft and verify models. This paper proposes SpecDiff-2, a novel framework to jointly address these two bottlenecks. It leverages discrete diffusion as a non-autoregressive drafter to address bottleneck (1) and develops novel techniques to calibrate discrete diffusion drafters with autoregressive verifiers, addressing bottleneck (2). Experimental results across a comprehensive benchmark suite show that SpecDiff-2 achieves a new state-of-the-art across reasoning, coding, and mathematical benchmarks, improving tokens-per-second by up to an average of +55% over previous baselines and obtaining up to 5.5x average speed-up over standard decoding, without any loss of accuracy.
翻译:推测解码已成为加速大语言模型(LLM)推理的标准方法。它利用无损的草稿-验证流程来规避自回归解码的延迟,实现了显著的加速效果。然而,当前推测解码方法仍受限于两个基本瓶颈:(1)草稿过程中的自回归依赖性限制了并行性,以及(2)草稿模型与验证模型之间的错位导致草稿标记频繁被拒绝。本文提出SpecDiff-2,一种新颖的框架来共同解决这两个瓶颈。它利用离散扩散作为非自回归草稿器以解决瓶颈(1),并开发了新技术来校准离散扩散草稿器与自回归验证器,以解决瓶颈(2)。在综合基准测试套件上的实验结果表明,SpecDiff-2在推理、编码和数学基准上均达到了新的最先进水平,相较于先前基线平均每秒标记数提升高达+55%,并在不损失任何准确性的前提下,相较于标准解码实现了最高5.5倍的平均加速。