Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding

Speculative decoding (SD) accelerates large language model inference by leveraging a draft-then-verify paradigm. To maximize the acceptance rate, recent methods construct expansive draft trees, which unfortunately incur severe VRAM bandwidth and computational overheads that bottleneck end-to-end speedups. While dynamic-depth pruning can reduce this latency by removing marginal branches, it also discards potentially valid candidates, preventing the acceptance rate from reaching the upper bound of dense trees. In this paper, we identify a critical opportunity in resource allocation: the transition from dense to pruned drafting frees up significant computational budget. To break this Pareto tradeoff, we introduce Graft, a compensation framework that couples pruning and retrieval as mutually reinforcing operations. Pruning supplies sufficient budget for retrieval, while retrieval compensates for pruning-induced coverage loss and recovers accepted length. By employing a sequential `prune-then-graft' mechanism, Graft attaches highly predictive retrieved tokens into positions opened by pruning, filling the topological gaps with near-zero overhead. Graft is entirely training-free and lossless. Comprehensive evaluations show that Graft establishes a new Pareto frontier across practical deployment settings, including short-context generation, long-context generation, and large-scale models. On short-context benchmarks, it achieves up to 5.41$\times$ speedup and improves average speedup over EAGLE-3 by up to 21.8% on the large-scale Qwen3-235B. We also provide a preliminary exploration of applying Graft to the DFlash-style block drafting paradigm, offering initial evidence and insights for extending grafting beyond autoregressive draft trees.

翻译：推测解码（SD）通过“草稿-验证”范式加速大语言模型推理。为最大化接受率，近期方法构建了庞大的草稿树，但这不幸导致了严重的显存带宽与计算开销，成为端到端加速的瓶颈。尽管动态深度剪枝可通过移除低效分支来降低延迟，但它也丢弃了可能有效的候选项，使接受率无法达到稠密树的上限。本文识别出资源分配中的关键机遇：从稠密草稿到剪枝草稿的转变可释放大量计算预算。为打破这一帕累托权衡，我们提出Graft——一种将剪枝与检索耦合为相互增强操作的补偿框架。剪枝为检索提供充足预算，而检索则补偿剪枝导致的覆盖率损失并恢复接受长度。通过采用顺序的“先剪后接”（prune-then-graft）机制，Graft将高预测性的检索令牌接入剪枝所开辟的位置，以近乎零开销填补拓扑间隙。Graft完全无需训练且无损。全面评估表明，Graft在包括短上下文生成、长上下文生成及大规模模型的实用部署场景中建立了新的帕累托前沿。在短上下文基准测试中，它实现了高达5.41倍加速，并在大规模Qwen3-235B模型上将平均加速比相较EAGLE-3提升最多21.8%。我们还对将Graft应用于DFlash风格的分块草稿范式进行了初步探索，为将嫁接扩展至自回归草稿树之外提供了初步证据与洞见。