To mitigate the high inference latency stemming from autoregressive decoding in Large Language Models (LLMs), Speculative Decoding has emerged as a novel decoding paradigm for LLM inference. In each decoding step, this method first drafts several future tokens efficiently and then verifies them in parallel. Unlike autoregressive decoding, Speculative Decoding facilitates the simultaneous decoding of multiple tokens per step, thereby accelerating inference. This paper presents a comprehensive overview and analysis of this promising decoding paradigm. We begin by providing a formal definition and formulation of Speculative Decoding. Then, we organize in-depth discussions on its key facets, such as drafter selection and verification strategies. Furthermore, we present a comparative analysis of leading methods under third-party testing environments. We aim for this work to serve as a catalyst for further research on Speculative Decoding, ultimately contributing to more efficient LLM inference.
翻译:为缓解大语言模型自回归解码导致的高推理延迟,推测解码已成为一种新兴的LLM推理解码范式。该方法在每个解码步骤中,首先高效草拟若干未来令牌,随后并行验证这些令牌。与自回归解码不同,推测解码支持每步同时解码多个令牌,从而显著提升推理速度。本文对这一前景广阔的解码范式进行了全面综述与分析。我们首先给出推测解码的形式化定义与数学表述,继而围绕其核心要素(如草拟器选择策略与验证机制)展开深度探讨。此外,我们在第三方测试环境下对主流方法进行了对比分析。本研究旨在推动推测解码领域的后续探索,最终为提升LLM推理效率作出贡献。