Inference efficiency in Large Language Models (LLMs) is fundamentally limited by their serial, autoregressive generation, especially as reasoning becomes a key capability and response sequences grow longer. Speculative decoding (SD) offers a powerful solution, providing significant speed-ups through its lightweight drafting and parallel verification mechanism. While existing work has nearly saturated improvements in draft effectiveness and efficiency, this paper advances SD from a new yet critical perspective: the verification cost. We propose TriSpec, a novel ternary SD framework that, at its core, introduces a lightweight proxy to significantly reduce computational cost by approving easily verifiable draft sequences and engaging the full target model only when encountering uncertain tokens. TriSpec can be integrated with state-of-the-art SD methods like EAGLE-3 to further reduce verification costs, achieving greater acceleration. Extensive experiments on the Qwen3 and DeepSeek-R1-Distill-Qwen/LLaMA families show that TriSpec achieves up to 35\% speedup over standard SD, with up to 50\% fewer target model invocations while maintaining comparable accuracy.
翻译:大型语言模型(LLM)的推理效率从根本上受限于其串行自回归生成方式,尤其在推理成为关键能力且响应序列日益增长的背景下。推测解码(SD)通过其轻量级草稿生成与并行验证机制提供了强大的解决方案,实现了显著的加速效果。现有研究在草稿生成的有效性与效率方面已接近饱和,本文则从验证成本这一新颖而关键的视角推进了SD的发展。我们提出了TriSpec——一种新颖的三元SD框架,其核心在于引入轻量级代理,通过自动批准易于验证的草稿序列,仅在遇到不确定标记时才调用完整目标模型,从而显著降低计算成本。TriSpec可与EAGLE-3等前沿SD方法结合,进一步降低验证开销,实现更高效的加速。在Qwen3和DeepSeek-R1-Distill-Qwen/LLaMA系列模型上的大量实验表明,TriSpec相比标准SD最高可实现35%的加速效果,目标模型调用次数减少达50%,同时保持相当的准确性。