Autoregressive decoding inherently limits the inference throughput of Large Language Model (LLM) due to its sequential dependency. Speculative decoding mitigates this by verifying multiple predicted tokens in parallel, but its efficiency remains constrained by what we identify as verification heterogeneity -- the uneven difficulty of verifying different speculative candidates. In practice, a small subset of high-confidence predictions accounts for most successful verifications, yet existing methods treat all candidates uniformly, leading to redundant computation. We present HeteroSpec, a heterogeneity-adaptive speculative decoding framework that allocates verification effort in proportion to candidate uncertainty. HeteroSpec estimates verification complexity using a lightweight entropy-based quantifier, partitions candidates via a data-driven stratification policy, and dynamically tunes speculative depth and pruning thresholds through coordinated optimization. Across five benchmarks and four LLMs, HeteroSpec delivers an average 4.24$\times$ decoding speedup over state-of-the-art methods such as EAGLE-3, while preserving exact output distributions. Crucially, HeteroSpec requires no model retraining and remains compatible with other inference optimizations, making it a practical direction for improving speculative decoding efficiency.
翻译:自回归解码因其顺序依赖性,本质上限制了大型语言模型(LLM)的推理吞吐量。推测解码通过并行验证多个预测令牌来缓解这一问题,但其效率仍受限于我们所定义的验证异质性——即验证不同推测候选令牌的难度不均衡。在实践中,一小部分高置信度预测占据了大多数成功验证,而现有方法对所有候选令牌进行统一处理,导致冗余计算。本文提出 HeteroSpec,一种异质性自适应的推测解码框架,该框架根据候选令牌的不确定性按比例分配验证资源。HeteroSpec 使用基于熵的轻量化度量器估计验证复杂度,通过数据驱动的分层策略划分候选令牌,并通过协同优化动态调整推测深度与剪枝阈值。在五个基准测试和四种 LLM 上的实验表明,相较于 EAGLE-3 等前沿方法,HeteroSpec 实现了平均 4.24$\times$ 的解码加速,同时保持输出分布的精确性。关键的是,HeteroSpec 无需模型重训练,且兼容其他推理优化技术,为提升推测解码效率提供了实用化路径。