TALON：基于置信度感知的自适应令牌树推测解码 (TALON: Confidence-Aware Speculative Decoding with Adaptive Token Trees)

Speculative decoding (SD) has become a standard technique for accelerating LLM inference without sacrificing output quality. Recent advances in speculative decoding have shifted from sequential chain-based drafting to tree-structured generation, where the draft model constructs a tree of candidate tokens to explore multiple possible drafts in parallel. However, existing tree-based SD methods typically build a fixed-width, fixed-depth draft tree, which fails to adapt to the varying difficulty of tokens and contexts. As a result, the draft model cannot dynamically adjust the tree structure to early stop on difficult tokens and extend generation for simple ones. To address these challenges, we introduce TALON, a training-free, budget-driven adaptive tree expansion framework that can be plugged into existing tree-based methods. Unlike static methods, TALON constructs the draft tree iteratively until a fixed token budget is met, using a hybrid expansion strategy that adaptively allocates the node budget to each layer of the draft tree. This framework naturally shapes the draft tree into a "deep-and-narrow" form for deterministic contexts and a "shallow-and-wide" form for uncertain branches, effectively optimizing the trade-off between exploration width and generation depth under a given budget. Extensive experiments across 5 models and 6 datasets demonstrate that TALON consistently outperforms state-of-the-art EAGLE-3, achieving up to 5.16x end-to-end speedup over auto-regressive decoding.

翻译：推测解码（SD）已成为在不牺牲输出质量前提下加速大语言模型推理的标准技术。推测解码的最新进展已从基于顺序链的草稿生成转向树状结构生成，即草稿模型构建候选令牌树以并行探索多种可能的草稿序列。然而，现有的基于树的SD方法通常构建固定宽度和固定深度的草稿树，无法适应不同令牌和上下文的难度变化。因此，草稿模型无法动态调整树结构以在困难令牌处提前终止，并在简单令牌处延长生成。为解决这些挑战，我们提出TALON——一种无需训练、预算驱动的自适应树扩展框架，可无缝集成到现有树状方法中。与静态方法不同，TALON通过混合扩展策略迭代构建草稿树直至达到预设令牌预算，该策略自适应地将节点预算分配到草稿树的每一层。该框架自然地将草稿树塑造成确定性上下文下的“深窄型”结构和不确定分支下的“浅宽型”结构，从而在给定预算下有效优化探索宽度与生成深度之间的权衡。在5个模型和6个数据集上的大量实验表明，TALON持续优于当前最先进的EAGLE-3方法，相比自回归解码实现了最高达5.16倍的端到端加速。