While Large Language Models (LLMs) have shown remarkable abilities, they are hindered by significant resource consumption and considerable latency due to autoregressive processing. In this study, we introduce Adaptive N-gram Parallel Decoding (ANPD), an innovative and lossless approach that accelerates inference by allowing the simultaneous generation of multiple tokens. ANPD incorporates a two-stage approach: it begins with a rapid drafting phase that employs an N-gram module, which adapts based on the current interactive context, followed by a verification phase, during which the original LLM assesses and confirms the proposed tokens. Consequently, ANPD preserves the integrity of the LLM's original output while enhancing processing speed. We further leverage a multi-level architecture for the N-gram module to enhance the precision of the initial draft, consequently reducing inference latency. ANPD eliminates the need for retraining or extra GPU memory, making it an efficient and plug-and-play enhancement. In our experiments, models such as LLaMA and its fine-tuned variants have shown speed improvements up to 3.67x, validating the effectiveness of our proposed ANPD.
翻译:尽管大语言模型(LLMs)展现出卓越的能力,但由于自回归处理方式,它们仍受限于显著的资源消耗和可观的延迟。本研究提出自适应N元并行解码(ANPD),这是一种创新且无损的方法,通过允许同时生成多个标记来加速推理。ANPD采用两阶段方法:首先是一个快速的草稿生成阶段,该阶段使用一个基于当前交互上下文自适应调整的N元模块;随后是一个验证阶段,在此阶段原始LLM评估并确认所提议的标记。因此,ANPD在提升处理速度的同时,保持了LLM原始输出的完整性。我们进一步利用N元模块的多层级架构来提高初始草稿的精确度,从而降低推理延迟。ANPD无需重新训练或额外的GPU内存,使其成为一种高效且即插即用的增强方案。在我们的实验中,LLaMA及其微调变体等模型实现了高达3.67倍的加速,验证了我们所提出的ANPD的有效性。