Large language models (LLMs) have demonstrated an impressive ability to generate codes on competitive programming tasks. However, with limited sample numbers, LLMs still suffer from poor accuracy. Inspired by the process of human programming, we propose a generate-and-edit approach that utilizes execution results of the generated code from LLMs to improve the code quality on the competitive programming task. We execute the generated code on the example test case provided in the question and wrap execution results into a supplementary comment. Utilizing this comment as guidance, our fault-aware code editor is employed to correct errors in the generated code. We perform extensive evaluations across two competitive programming datasets with nine different LLMs. Compared to directly generating from LLMs, our approach can improve the average of pass@1 by 89\% on APPS-dev, 31\% on APPS-test, and 48\% on HumanEval over nine popular code generation LLMs with parameter sizes ranging from 110M to 175B. Compared to other post-processing methods, our method demonstrates superior accuracy and efficiency.
翻译:大型语言模型(LLMs)在竞争性编程任务上展现了令人瞩目的代码生成能力。然而,在样本数量有限的情况下,LLMs 的准确性仍然较低。受人类编程过程的启发,我们提出了一种“生成-编辑”方法,该方法利用 LLMs 生成代码的执行结果来提升竞争性编程任务中的代码质量。我们在问题提供的示例测试用例上执行生成的代码,并将执行结果封装为补充性注释。以此注释为指导,我们的故障感知代码编辑器被用于纠正生成代码中的错误。我们在两个竞争性编程数据集上使用九种不同的 LLMs 进行了广泛评估。与直接由 LLMs 生成代码相比,我们的方法在参数规模从 1.1 亿到 1750 亿的九种主流代码生成 LLMs 上,平均将 pass@1 指标提升了 89%(APPS-dev)、31%(APPS-test)以及 48%(HumanEval)。与其他后处理方法相比,我们的方法展现了更高的准确性和效率。