SpecDiff-2：扩展扩散草稿器对齐以实现更快的推测解码 (SpecDiff-2: Scaling Diffusion Drafter Alignment For Faster Speculative Decoding)

Speculative decoding has become the standard approach for accelerating Large Language Model (LLM) inference. It exploits a lossless draft-then-verify procedure to circumvent the latency of autoregressive decoding, achieving impressive speed-ups. Yet, current speculative decoding approaches remain limited by two fundamental bottlenecks: (1) the autoregressive dependency during drafting which limits parallelism, and (2) frequent rejections of draft tokens caused by misalignment between the draft and verify models. This paper proposes SpecDiff-2, a novel framework to jointly address these two bottlenecks. It leverages discrete diffusion as a non-autoregressive drafter to address bottleneck (1) and develops novel techniques to calibrate discrete diffusion drafters with autoregressive verifiers, addressing bottleneck (2). Experimental results across a comprehensive benchmark suite show that SpecDiff-2 achieves a new state-of-the-art across reasoning, coding, and mathematical benchmarks, improving tokens-per-second by up to an average of +55% over previous baselines and obtaining up to 5.5x average speed-up over standard decoding, without any loss of accuracy.

翻译：推测解码已成为加速大语言模型（LLM）推理的标准方法。它利用无损的草稿-验证流程来规避自回归解码的延迟，实现了显著的加速效果。然而，当前推测解码方法仍受限于两个基本瓶颈：（1）草稿过程中的自回归依赖性限制了并行性，以及（2）草稿模型与验证模型之间的错位导致草稿标记频繁被拒绝。本文提出SpecDiff-2，一种新颖的框架来共同解决这两个瓶颈。它利用离散扩散作为非自回归草稿器以解决瓶颈（1），并开发了新技术来校准离散扩散草稿器与自回归验证器，以解决瓶颈（2）。在综合基准测试套件上的实验结果表明，SpecDiff-2在推理、编码和数学基准上均达到了新的最先进水平，相较于先前基线平均每秒标记数提升高达+55%，并在不损失任何准确性的前提下，相较于标准解码实现了最高5.5倍的平均加速。