Large language models (LLMs) have demonstrated an impressive ability to generate codes on competitive programming tasks. However, with limited sample numbers, LLMs still suffer from poor accuracy. Inspired by the process of human programming, we propose a generate-and-edit approach named Self-Edit that utilizes execution results of the generated code from LLMs to improve the code quality on the competitive programming task. We execute the generated code on the example test case provided in the question and wrap execution results into a supplementary comment. Utilizing this comment as guidance, our fault-aware code editor is employed to correct errors in the generated code. We perform extensive evaluations across two competitive programming datasets with nine different LLMs. Compared to directly generating from LLMs, our approach can improve the average of pass@1 by 89\% on APPS-dev, 31\% on APPS-test, and 48\% on HumanEval over nine popular code generation LLMs with parameter sizes ranging from 110M to 175B. Compared to other post-processing methods, our method demonstrates superior accuracy and efficiency.
翻译:大语言模型在竞赛编程任务中展现了出色的代码生成能力。然而,在样本数量有限的情况下,大语言模型的准确性仍然较低。受人类编程过程的启发,我们提出了一种名为Self-Edit的生成-编辑方法,该方法利用大语言模型生成代码的执行结果来提升竞赛编程任务中的代码质量。我们将生成的代码在题目中提供的示例测试用例上运行,并将执行结果封装为补充注释。以此注释为指导,我们的故障感知代码编辑器用于纠正生成代码中的错误。我们在两个竞赛编程数据集上使用九种不同的大语言模型进行了广泛评估。与直接由大语言模型生成代码相比,我们的方法在APPS-dev、APPS-test和HumanEval数据集上,对九种参数规模从110M到175B的流行代码生成大语言模型,平均将pass@1指标分别提升了89%、31%和48%。与其他后处理方法相比,我们的方法在准确性和效率上均表现出更优的性能。