Speculative decoding accelerates autoregressive large language model inference by drafting multiple tokens and verifying them in a single target-model forward pass. Recent diffusion-based drafters generate an entire block of tokens in parallel but usually commit to a single draft sequence per verification: once the first mismatch occurs, all subsequent draft tokens are discarded, resulting in a limited acceptance rate. Naively batching more draft candidate sequences only introduces a marginal improvement, as redundant or poorly placed branches increase the cost of drafting and verification without proportionally increasing the number of accepted tokens. We propose D^2SD, a dual diffusion draft speculative decoding framework that organizes candidates into a confidence-guided prefix tree, where the first diffusion drafter generates a block along with per-position confidence scores that are used to identify the most likely rejection boundary and select the top-K prefix ranges for recovery; the second variable-prefix diffusion drafter re-anchors at each selected prefix and proposes alternative continuations in one batched pass; the resulting shared-prefix candidates are jointly verified via cascade attention. Empirically, D^2SD shows clear improvements over both the underlying diffusion approach and strong autoregressive speculative decoding baselines.
翻译:投机解码通过草拟多个令牌并在单一目标模型前向传播中验证它们,加速自回归大语言模型推理。近期基于扩散的草稿模型可并行生成整个令牌块,但通常每次验证仅提交单一草稿序列:一旦首次出现不匹配,所有后续草稿令牌均被丢弃,导致接受率受限。简单地对多个草稿候选序列进行批处理仅带来边际改进,因为冗余或位置不当的分支增加了草稿生成和验证的计算成本,却未能按比例增加接受令牌数量。我们提出D^2SD,一种双重扩散草稿投机解码框架,将候选组织成置信度引导的前缀树:第一扩散草稿模型生成一个令牌块,同时给出逐位置置信度分数,用于识别最可能的拒绝边界并选择前K个前缀范围进行恢复;第二可变前缀扩散草稿模型在每个选定前缀处重新锚定,在一次批处理中提出替代续写;得到的共享前缀候选通过级联注意力联合验证。实验表明,D^2SD相比基础扩散方法和强自回归投机解码基线均取得显著改进。