Code generation leverages artificial intelligence technologies, particularly Large Language Models (LLMs), to automatically produce source code, enhancing software development efficiency and reducing repetitive tasks. However, the LLMs' generated code often fails to pass test cases and requires substantial human effort to fix errors. Previous studies focused on better prompts or improving LLMs' capability but ignored why LLMs failed. In this paper, we first reproduced 14 LLMs, including GPT-3.5-turbo and 13 open-source LLMs, on the HumanEval dataset. We extracted 12,837 code generation errors and conducted an in-depth analysis of their causes, which led to the identification of 19 distinct error causes. Our empirical analysis indicated that three of these causes can be directly fixed. Consequently, we proposed a fixing method called LlmFix, which addresses these three types of errors through a three-step process: filtering code for indentation correction, truncating redundant generated code, and importing missing modules. Experimental results demonstrate that LlmFix can fix these three types of errors, significantly improving the performance of 14 LLMs on HumanEval and MBPP datasets with average increases of 9.5% and 5.4%, respectively.
翻译:代码生成利用人工智能技术,特别是大型语言模型(LLMs),自动生成源代码,以提高软件开发效率并减少重复性任务。然而,LLMs生成的代码通常无法通过测试用例,并且需要大量人工努力来修复错误。先前的研究侧重于改进提示或提升LLMs的能力,但忽略了LLMs失败的原因。在本文中,我们首先在HumanEval数据集上复现了包括GPT-3.5-turbo和13个开源LLMs在内的14个LLMs。我们提取了12,837个代码生成错误,并对其原因进行了深入分析,从而识别出19种不同的错误原因。我们的实证分析表明,其中三种原因可以直接修复。因此,我们提出了一种名为LlmFix的修复方法,该方法通过三步流程解决这三类错误:筛选代码以修正缩进、截断冗余生成的代码以及导入缺失的模块。实验结果表明,LlmFix能够修复这三类错误,显著提升了14个LLMs在HumanEval和MBPP数据集上的性能,平均分别提高了9.5%和5.4%。