Code generation is a latency-sensitive task that demands high timeliness, but the autoregressive decoding mechanism of Large Language Models (LLMs) leads to poor inference efficiency. Existing LLM inference acceleration methods mainly focus on standalone functions using only built-in components. Moreover, they treat code like natural language sequences, ignoring its unique syntax and semantic characteristics. As a result, the effectiveness of these approaches in code generation tasks remains limited and fails to align with real-world programming scenarios. To alleviate this issue, we propose CodeSwift, a simple yet highly efficient inference acceleration approach specifically designed for code generation, without comprising the quality of the output. CodeSwift constructs a multi-source datastore, providing access to both general and project-specific knowledge, facilitating the retrieval of high-quality draft sequences. Moreover, CodeSwift reduces retrieval cost by controlling retrieval timing, and enhances efficiency through parallel retrieval and a context- and LLM preference-aware cache. Experimental results show that CodeSwift can reach up to 2.53x and 2.54x speedup compared to autoregressive decoding in repository-level and standalone code generation tasks, respectively, outperforming state-of-the-art inference acceleration approaches by up to 88%.
翻译:代码生成是一项对延迟敏感、要求高时效性的任务,但大语言模型(LLMs)的自回归解码机制导致其推理效率低下。现有的大语言模型推理加速方法主要聚焦于仅使用内置组件的独立函数。此外,它们将代码视为自然语言序列进行处理,忽略了其独特的语法和语义特征。因此,这些方法在代码生成任务中的有效性仍然有限,且未能与实际编程场景相匹配。为缓解此问题,我们提出了CodeSwift,一种专为代码生成设计、简单且高效的推理加速方法,且不损害输出质量。CodeSwift构建了一个多源数据存储,提供对通用知识和项目特定知识的访问,便于检索高质量的草稿序列。此外,CodeSwift通过控制检索时机来降低检索成本,并通过并行检索以及一个结合上下文和LLM偏好的缓存来提升效率。实验结果表明,在仓库级和独立代码生成任务中,CodeSwift相比自回归解码分别可实现高达2.53倍和2.54倍的加速,优于当前最先进的推理加速方法达88%。