Large language models (LLMs) have demonstrated an impressive ability to generate codes on competitive programming tasks. However, with limited sample numbers, LLMs still suffer from poor accuracy. Inspired by the process of human programming, we propose a generate-and-edit approach named Self-Edit that utilizes execution results of the generated code from LLMs to improve the code quality on the competitive programming task. We execute the generated code on the example test case provided in the question and wrap execution results into a supplementary comment. Utilizing this comment as guidance, our fault-aware code editor is employed to correct errors in the generated code. We perform extensive evaluations across two competitive programming datasets with nine different LLMs. Compared to directly generating from LLMs, our approach can improve the average of pass@1 by 89\% on APPS-dev, 31\% on APPS-test, and 48\% on HumanEval over nine popular code generation LLMs with parameter sizes ranging from 110M to 175B. Compared to other post-processing methods, our method demonstrates superior accuracy and efficiency.
翻译:大语言模型在竞争性编程任务上展现出令人瞩目的代码生成能力。然而,在有限样本数量下,其准确性仍显不足。受人类编程过程启发,我们提出名为"编辑自省"的生成-编辑方法,该方法利用大语言模型生成代码的执行结果来提升竞争性编程任务中的代码质量。我们在问题提供的示例测试用例上执行生成的代码,并将执行结果封装为补充注释。借助此注释作为指引,我们的故障感知代码编辑器可纠正生成代码中的错误。我们在两个竞争性编程数据集上使用九种不同的大语言模型进行了全面评估。与直接由大语言模型生成相比,在参数规模从1.1亿到1750亿的九种主流代码生成大语言模型上,我们的方法使APPS-dev数据集的pass@1指标平均提升89%,APPS-test数据集提升31%,HumanEval数据集提升48%。相较于其他后处理方法,我们的方法在准确性和效率上均表现出更优性能。