To mitigate the high inference latency stemming from autoregressive decoding in Large Language Models (LLMs), Speculative Decoding has emerged as a novel decoding paradigm for LLM inference. In each decoding step, this method first efficiently drafts several future tokens and then verifies them in parallel. Unlike autoregressive decoding, Speculative Decoding facilitates the simultaneous decoding of multiple tokens per step, thereby accelerating inference. This paper presents a comprehensive overview and analysis of this promising decoding paradigm. We begin by providing a formal definition and formulation of Speculative Decoding. Then, we organize in-depth discussions on its key facets, including current leading techniques, the challenges faced, and potential future directions in this field. We aim for this work to serve as a catalyst for further research on Speculative Decoding, ultimately contributing to more efficient LLM inference.
翻译:为缓解大型语言模型(LLM)中自回归解码带来的高推理延迟,推测解码作为一种新颖的解码范式应运而生。在每一步解码中,该方法首先高效草拟若干未来标记,随后并行验证它们。与自回归解码不同,推测解码支持每步同时解码多个标记,从而加速推理。本文对这一有前景的解码范式进行了全面概述与分析。我们首先给出推测解码的形式化定义与公式。随后,深入探讨其关键方面,包括当前主流技术、面临的挑战以及该领域的潜在未来方向。我们期望本研究能推动推测解码的进一步探索,最终助力更高效的LLM推理。