Diffusion LLMs (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs (AR-LLMs) with the potential to operate at significantly higher token-generation rates. To unlock this potential, we present Spiffy, a speculative decoding algorithm to accelerate dLLM inference while provably preserving the model's output distribution. This work addresses the unique challenges involved in applying ideas from speculative decoding of AR-LLMs to dLLMs. Spiffy performs auto-speculation to eliminate the overheads of an independent draft model, structuring draft states in the form of a novel directed draft graph to take advantage of the bidirectional, blockwise nature of dLLM generation. These draft graphs are calibrated offline to maximize acceptance rates and are dynamically pruned during inference for improved computational efficiency. We present a detailed formulation of Spiffy and demonstrate its ability to accelerate LLaDA, Dream, and SDAR models in combination with KV caching and threshold-based dynamic unmasking leading to up to $8.6\times$ reduction in model inferences and $6.3\times$ acceleration in token rate.
翻译:扩散LLM(dLLM)近期作为自回归LLM(AR-LLM)的有力替代方案崭露头角,具有以显著更高令牌生成速率运行的潜力。为释放这一潜力,我们提出Spiffy——一种投机解码算法,可在加速dLLM推理的同时可证明地保持模型输出分布。本工作解决了将AR-LLM投机解码思想应用于dLLM时所涉及的独特挑战。Spiffy执行自动投机以消除独立草稿模型的开销,以新型有向草稿图的形式构建草稿状态,从而利用dLLM生成的双向、块状特性。这些草稿图通过离线校准以最大化接受率,并在推理过程中动态剪枝以提高计算效率。我们给出了Spiffy的详细公式化表述,并展示了其与KV缓存及基于阈值的动态解掩码相结合,在加速LLaDA、Dream和SDAR模型方面的能力,具体表现为模型推理量降低高达$8.6\times$,令牌速率加速达$6.3\times$。