Verification is a key bottleneck in improving inference speed while maintaining distribution fidelity in Speculative Decoding. Recent work has shown that sequence-level verification leads to a higher number of accepted tokens compared to token-wise verification. However, existing solutions often rely on surrogate approximations or are constrained by partial information, struggling with joint intractability. In this work, we propose Hierarchical Speculative Decoding (HSD), a provably lossless verification method that significantly boosts the expected number of accepted tokens and overcomes joint intractability by balancing excess and deficient probability mass across accessible branches. Our extensive large-scale experiments demonstrate that HSD yields consistent improvements in acceptance rates across diverse model families and benchmarks. Moreover, its strong explainability and generality make it readily integrable into a wide range of speculative decoding frameworks. Notably, integrating HSD into EAGLE-3 yields over a 12% performance gain, establishing state-of-the-art decoding efficiency without compromising distribution fidelity. Code is available at https://github.com/ZhouYuxuanYX/Hierarchical-Speculative-Decoding.
翻译:验证是推测解码中在保持分布保真度的同时提升推理速度的关键瓶颈。近期研究表明,与逐令牌验证相比,序列级验证能接受更多令牌。然而,现有解决方案常依赖代理近似或受限于局部信息,难以处理联合难解性问题。本研究提出分层推测解码(HSD),一种可证明的无损验证方法,该方法通过在可访问分支间平衡超额与不足概率质量,显著提升期望接受令牌数并克服联合难解性。我们的大规模实验表明,HSD在不同模型族与基准测试中均能持续提升接受率。此外,其强大的可解释性与通用性使其易于集成到各类推测解码框架中。值得注意的是,将HSD集成至EAGLE-3可获得超过12%的性能增益,在不损害分布保真度的前提下实现了最先进的解码效率。代码发布于 https://github.com/ZhouYuxuanYX/Hierarchical-Speculative-Decoding。