Large language models (LLMs) have demonstrated an impressive ability to generate codes on competitive programming tasks. However, with limited sample numbers, LLMs still suffer from poor accuracy. Inspired by the process of human programming, we propose a generate-and-edit approach named Self-Edit that utilizes execution results of the generated code from LLMs to improve the code quality on the competitive programming task. We execute the generated code on the example test case provided in the question and wrap execution results into a supplementary comment. Utilizing this comment as guidance, our fault-aware code editor is employed to correct errors in the generated code. We perform extensive evaluations across two competitive programming datasets with nine different LLMs. Compared to directly generating from LLMs, our approach can improve the average of pass@1 by 89\% on APPS-dev, 31\% on APPS-test, and 48\% on HumanEval over nine popular code generation LLMs with parameter sizes ranging from 110M to 175B. Compared to other post-processing methods, our method demonstrates superior accuracy and efficiency.
翻译:大型语言模型在竞争性编程任务中展现出令人瞩目的代码生成能力,然而受限于有限的样本数量,其准确性仍然欠佳。受人类编程过程的启发,我们提出了一种名为Self-Edit的生成-编辑方法,该方法利用大型语言模型生成代码的执行结果来提升竞争性编程任务中的代码质量。我们在题目提供的示例测试用例上执行生成的代码,并将执行结果封装为补充注释。以该注释为引导,我们的错误感知代码编辑器用于修正生成代码中的错误。我们在两个竞争性编程数据集上使用九种不同的大型语言模型进行了广泛评估。与直接由大型语言模型生成相比,我们的方法在九种参数规模从1.1亿到1750亿的流行代码生成模型上,将APPS-dev数据集的pass@1平均提升89%、APPS-test数据集提升31%、HumanEval数据集提升48%。与其他后处理方法相比,我们的方法在准确性和效率上均展现出更优性能。