Despite the crucial importance of accelerating text generation in large language models (LLMs) for efficiently producing content, the sequential nature of this process often leads to high inference latency, posing challenges for real-time applications. Various techniques have been proposed and developed to address these challenges and improve efficiency. This paper presents a comprehensive survey of accelerated generation techniques in autoregressive language models, aiming to understand the state-of-the-art methods and their applications. We categorize these techniques into several key areas: speculative decoding, early exiting mechanisms, and non-autoregressive methods. We discuss each category's underlying principles, advantages, limitations, and recent advancements. Through this survey, we aim to offer insights into the current landscape of techniques in LLMs and provide guidance for future research directions in this critical area of natural language processing.
翻译:尽管加速大型语言模型(LLMs)的文本生成对于高效内容生产至关重要,但该过程的序列特性常导致高推理延迟,给实时应用带来挑战。为应对这些挑战并提升效率,学界已提出并发展了多种技术。本文对自回归语言模型中的加速生成技术进行了全面综述,旨在理解前沿方法及其应用。我们将这些技术归纳为几个关键领域:推测解码、提前退出机制以及非自回归方法。针对每个类别,我们探讨其基本原理、优势、局限性和最新进展。通过本综述,我们期望为理解LLMs加速技术的现状提供见解,并为这一自然语言处理关键领域的未来研究方向提供指引。