Speculative decoding is an effective and lossless approach for accelerating LLM inference. However, existing widely adopted model-based draft designs, such as EAGLE3, improve accuracy at the cost of multi-step autoregressive inference, resulting in high drafting latency and ultimately rendering the drafting stage itself a performance bottleneck. Inspired by diffusion-based large language models (dLLMs), we propose DART, which leverages parallel generation to reduce drafting latency. DART predicts logits for multiple future masked positions in parallel within a single forward pass based on hidden states of the target model, thereby eliminating autoregressive rollouts in the draft model while preserving a lightweight design. Based on these parallel logit predictions, we further introduce an efficient tree pruning algorithm that constructs high-quality draft token trees with N-gram-enforced semantic continuity. DART substantially reduces draft-stage overhead while preserving high draft accuracy, leading to significantly improved end-to-end decoding speed. Experimental results demonstrate that DART achieves a 2.03x--3.44x wall-clock time speedup across multiple datasets, surpassing EAGLE3 by 30% on average and offering a practical speculative decoding framework. Code is released at https://github.com/fvliang/DART.
翻译:推测解码是一种有效且无损的加速大语言模型推理的方法。然而,现有广泛采用的基于模型的草稿设计,如EAGLE3,以提高准确性为代价,采用了多步自回归推理,导致草稿生成延迟高,并最终使草稿生成阶段本身成为性能瓶颈。受基于扩散的大语言模型启发,我们提出了DART,它利用并行生成来降低草稿生成延迟。DART基于目标模型的隐藏状态,在单次前向传播中并行预测多个未来掩码位置的逻辑值,从而在保持轻量级设计的同时,消除了草稿模型中的自回归展开。基于这些并行的逻辑值预测,我们进一步引入了一种高效的树剪枝算法,该算法构建具有N-gram强制语义连续性的高质量草稿令牌树。DART在保持高草稿准确性的同时,显著减少了草稿阶段的开销,从而显著提高了端到端的解码速度。实验结果表明,DART在多个数据集上实现了2.03倍至3.44倍的挂钟时间加速,平均超越EAGLE3 30%,提供了一个实用的推测解码框架。代码发布于 https://github.com/fvliang/DART。