Large language models (LLMs) have demonstrated an impressive ability to generate codes on competitive programming tasks. However, with limited sample numbers, LLMs still suffer from poor accuracy. Inspired by the process of human programming, we propose a generate-and-edit approach named Self-Edit that utilizes execution results of the generated code from LLMs to improve the code quality on the competitive programming task. We execute the generated code on the example test case provided in the question and wrap execution results into a supplementary comment. Utilizing this comment as guidance, our fault-aware code editor is employed to correct errors in the generated code. We perform extensive evaluations across two competitive programming datasets with nine different LLMs. Compared to directly generating from LLMs, our approach can improve the average of pass@1 by 89\% on APPS-dev, 31\% on APPS-test, and 48\% on HumanEval over nine popular code generation LLMs with parameter sizes ranging from 110M to 175B. Compared to other post-processing methods, our method demonstrates superior accuracy and efficiency.
翻译:大语言模型在竞赛编程任务中展现出了令人瞩目的代码生成能力。然而,在样本数量有限的情况下,大语言模型仍存在准确率低下的问题。受人类编程过程的启发,我们提出了一种名为"自我编辑"的生成-编辑方法,该方法利用大语言模型生成代码的执行结果,以提升竞赛编程任务的代码质量。我们根据题目中提供的示例测试用例执行生成的代码,并将执行结果封装为补充注释。以该注释为引导,我们的故障感知代码编辑器可修正生成代码中的错误。我们在两个竞赛编程数据集上对九种不同的大语言模型进行了广泛评估。与直接由大语言模型生成代码相比,我们的方法在APPS-dev数据集上将pass@1平均值提升了89%,在APPS-test数据集上提升了31%,在HumanEval数据集上提升了48%(覆盖参数规模从1.1亿到1750亿的九种主流代码生成大语言模型)。与其他后处理方法相比,我们的方法展现出更优的准确性和效率。