Autoregressive large language models (LLMs) deliver strong performance but require inherently sequential decoding, leading to high inference latency and poor GPU utilization. Speculative decoding mitigates this bottleneck by using a fast draft model whose outputs are verified in parallel by the target LLM; however, existing methods still rely on autoregressive drafting, which remains sequential and limits practical speedups. Diffusion LLMs offer a promising alternative by enabling parallel generation, but current diffusion models typically underperform compared with autoregressive models. In this paper, we introduce DFlash, a speculative decoding framework that employs a lightweight block diffusion model for parallel drafting. By generating draft tokens in a single forward pass and conditioning the draft model on context features extracted from the target model, DFlash enables efficient drafting with high-quality outputs and higher acceptance rates. Experiments show that DFlash achieves over 6x lossless acceleration across a range of models and tasks, delivering up to 2.5x higher speedup than the state-of-the-art speculative decoding method EAGLE-3.
翻译:自回归大语言模型(LLM)虽然性能强大,但其固有的顺序解码特性导致推理延迟高且GPU利用率低。推测解码通过使用快速草稿模型来缓解这一瓶颈,该模型的输出由目标LLM并行验证;然而,现有方法仍依赖于自回归草稿生成,其顺序性限制了实际加速效果。扩散LLM通过并行生成提供了有前景的替代方案,但当前扩散模型通常性能不及自回归模型。本文提出DFlash——一种采用轻量级块扩散模型进行并行草稿生成的推测解码框架。通过单次前向传播生成草稿词元,并将草稿模型基于从目标模型提取的上下文特征进行条件化,DFlash能够以高质量输出和高接受率实现高效草稿生成。实验表明,DFlash在多种模型和任务上实现了超过6倍的无损加速,比最先进的推测解码方法EAGLE-3提速高达2.5倍。