Large language models have demonstrated exceptional capability in natural language understanding and generation. However, their generation speed is limited by the inherently sequential nature of their decoding process, posing challenges for real-time applications. This paper introduces Lexical Unit Decoding (LUD), a novel decoding methodology implemented in a data-driven manner, accelerating the decoding process without sacrificing output quality. The core of our approach is the observation that a pre-trained language model can confidently predict multiple contiguous tokens, forming the basis for a \textit{lexical unit}, in which these contiguous tokens could be decoded in parallel. Extensive experiments validate that our method substantially reduces decoding time while maintaining generation quality, i.e., 33\% speed up on natural language generation with no quality loss, and 30\% speed up on code generation with a negligible quality loss of 3\%. Distinctively, LUD requires no auxiliary models and does not require changes to existing architectures. It can also be integrated with other decoding acceleration methods, thus achieving an even more pronounced inference efficiency boost. We posit that the foundational principles of LUD could define a new decoding paradigm for future language models, enhancing their applicability for a broader spectrum of applications. All codes are be publicly available at https://github.com/tjunlp-lab/Lexical-Unit-Decoding-LUD-. Keywords: Parallel Decoding, Lexical Unit Decoding, Large Language Model
翻译:大语言模型在自然语言理解与生成方面展现出卓越能力。然而,其生成速度受限于解码过程固有的序列性,这为实时应用带来了挑战。本文提出词汇单元解码(LUD),一种以数据驱动方式实现的新型解码方法,在不牺牲输出质量的前提下加速解码过程。该方法的核心在于观察到预训练语言模型能够自信地预测多个连续标记,这些标记构成\textit{词汇单元}的基础,使得这些连续标记能够并行解码。大量实验验证了本方法在保持生成质量的同时显著减少解码时间:在自然语言生成任务中实现33%的加速且无质量损失,在代码生成任务中实现30%的加速且仅有3%的可忽略质量损失。LUD的独特优势在于无需辅助模型,且无需改变现有架构。该方法还能与其他解码加速技术结合,从而获得更显著的推理效率提升。我们认为LUD的基础原理可能为未来语言模型定义新的解码范式,增强其在更广泛应用场景中的适用性。所有代码已在 https://github.com/tjunlp-lab/Lexical-Unit-Decoding-LUD- 公开。关键词:并行解码,词汇单元解码,大语言模型