To mitigate the high inference latency stemming from autoregressive decoding in Large Language Models (LLMs), Speculative Decoding has emerged as a novel decoding paradigm for LLM inference. In each decoding step, this method first drafts several future tokens efficiently and then verifies them in parallel. Unlike autoregressive decoding, Speculative Decoding facilitates the simultaneous decoding of multiple tokens per step, thereby accelerating inference. This paper presents a comprehensive overview and analysis of this promising decoding paradigm. We begin by providing a formal definition and formulation of Speculative Decoding. Then, we organize in-depth discussions on its key facets, such as drafter selection and verification strategies. Furthermore, we present a comparative analysis of leading methods under third-party testing environments. We aim for this work to serve as a catalyst for further research on Speculative Decoding, ultimately contributing to more efficient LLM inference.
翻译:为缓解大型语言模型(LLMs)中自回归解码带来的高推理延迟,推测解码作为一种新型解码范式应运而生。在每个解码步骤中,该方法首先高效地草拟若干未来令牌,随后并行验证这些令牌。与自回归解码不同,推测解码允许每步同时解码多个令牌,从而加速推理过程。本文对该有前景的解码范式进行了全面概述与分析。我们首先给出推测解码的形式化定义与公式化表述,进而深入探讨其关键方面(如草拟器选择与验证策略)。此外,我们还在第三方测试环境下对主流方法进行了比较分析。期望本工作能成为推动推测解码进一步研究的催化剂,最终助力实现更高效的LLM推理。