When to Stop? Towards Efficient Code Generation in LLMs with Excess Token Prevention

Code generation aims to automatically generate code snippets that meet given natural language requirements and plays an important role in software development. Although Code LLMs have shown excellent performance in this domain, their long generation time poses a signification limitation in practice use. In this paper, we first conduct an in-depth preliminary study with different Code LLMs on code generation tasks and identify a significant efficiency issue, i.e., continual generation of excess tokens. It harms the developer productivity and leads to huge computational wastes. To address it, we introduce CodeFast, an inference acceleration approach for Code LLMs on code generation. The key idea of CodeFast is to terminate the inference process in time when unnecessary excess tokens are detected. First, we propose an automatic data construction framework to obtain training data. Then, we train a unified lightweight model GenGuard applicable to multiple programming languages to predict whether to terminate inference at the current step. Finally, we enhance Code LLM with GenGuard to accelerate its inference in code generation tasks. We conduct extensive experiments with CodeFast on five representative Code LLMs across four widely used code generation datasets. Experimental results show that (1) CodeFast can significantly improve the inference speed of various Code LLMs in code generation, ranging form 34% to 452%, without compromising the quality of generated code. (2) CodeFast is stable across different parameter settings and can generalize to untrained datasets. Our code and data are available at https://github.com/DeepSoftwareAnalytics/CodeFast

翻译：代码生成旨在根据给定的自然语言需求自动生成代码片段，在软件开发中扮演着重要角色。尽管代码大语言模型在该领域展现出优异性能，但其较长的生成时间在实际应用中构成显著限制。本文首先针对不同代码大语言模型在代码生成任务上展开深入初步研究，发现一个关键效率问题：持续生成冗余令牌。这种现象不仅损害开发效率，更导致巨大的计算资源浪费。为解决该问题，我们提出CodeFast——一种面向代码大语言模型代码生成的推理加速方法。CodeFast的核心思想是在检测到不必要的冗余令牌时及时终止推理过程。首先，我们设计自动数据构建框架以获取训练数据；随后，训练适用于多种编程语言的统一轻量级模型GenGuard，用于预测当前步骤是否应终止推理；最后，通过GenGuard增强代码大语言模型以加速其在代码生成任务中的推理。我们在四个广泛使用的代码生成数据集上，对五种代表性代码大语言模型开展CodeFast的全面实验。实验结果表明：（1）CodeFast能在不降低生成代码质量的前提下，显著提升各类代码大语言模型的推理速度，加速比范围达34%至452%。（2）CodeFast在不同参数设置下保持稳定，并能泛化至未训练的数据集。我们的代码与数据已开源：https://github.com/DeepSoftwareAnalytics/CodeFast