Speculative decoding accelerates generation by verifying multiple drafted tokens in a single target-model forward pass, reducing sequential decoding iterations. Model-free variants avoid auxiliary draft models by reusing text and model states already available during generation, but their speedup depends on the reliability of the constructed drafts. We identify two limitations of existing reuse-based methods: lexically anchored retrieval has limited recall under surface-form variation, and deterministic span copying can be brittle when the retrieved context does not uniquely determine the continuation. We propose \emph{AdaPLD}, a training-free method that adaptively improves both retrieval and draft construction. AdaPLD preserves high-precision lexical reuse while using semantic similarity to recover additional reuse opportunities when lexical matching fails. It further constructs branched reuse hypotheses to account for continuation uncertainty, rather than relying on a single copied span. Across diverse benchmarks, AdaPLD reduces target-model forward passes and achieves up to $3.10\times$ decoding speedup.
翻译:摘要:推测解码通过在单次目标模型前向传播中验证多个草稿令牌,减少顺序解码迭代次数,从而加速生成过程。无模型变体通过复用生成过程中已可获取的文本和模型状态来避免辅助草案模型,但其加速效果取决于所构建草案的可靠性。我们指出现有基于复用的方法存在两大局限:词汇锚定检索在表层形式变化下召回率有限,且当检索到的上下文无法唯一确定续写内容时,确定性跨度复制可能表现脆弱。为此,我们提出免训练的《AdaPLD》方法,自适应改进检索与草案构建。AdaPLD在保持高精度词汇复用的同时,在词汇匹配失败时利用语义相似性发掘额外复用机会;并构建分支式复用假设以应对续写不确定性,而非依赖单一复制跨度。在多样化基准测试中,AdaPLD减少目标模型前向传播次数,实现高达$3.10\times$的解码加速。