Pre-trained code language models have achieved promising performance in code generation and improved the programming efficiency of human developers. However, their self-refinement capability is typically overlooked by the existing evaluations of code LMs, which focus only on the accuracy of the one-time prediction. For the cases when code LMs fail to implement the correct program, developers actually find it hard to debug and fix the faulty prediction since it is not written by the developers themselves. Unfortunately, our study reveals that code LMs cannot efficiently self-refine their faulty generations as well. In this paper, we propose CYCLE framework, learning to self-refine the faulty generation according to the available feedback, such as the execution results reported by the test suites. We evaluate CYCLE on three popular code generation benchmarks, HumanEval, MBPP, and APPS. The results reveal that CYCLE successfully maintains, sometimes improves, the quality of one-time code generation, while significantly improving the self-refinement capability of code LMs. We implement four variants of CYCLE with varied numbers of parameters across 350M, 1B, 2B, and 3B, and the experiments show that CYCLE consistently boosts the code generation performance, by up to 63.5%, across benchmarks and varied model sizes. We also notice that CYCLE outperforms code LMs that have 3$\times$ more parameters in self-refinement.
翻译:预训练代码语言模型在代码生成方面取得了显著成效,提升了人类开发者的编程效率。然而,现有代码语言模型评估通常忽视其自我优化能力,仅关注一次性预测的准确性。当代码语言模型未能生成正确程序时,开发者因难以调试和修复非自身编写的错误预测而面临困境。更糟的是,我们的研究表明,代码语言模型同样无法高效地自我优化其错误生成。本文提出CYCLE框架,通过利用可用反馈(如测试套件报告的执行结果)学习自我优化错误生成。我们在HumanEval、MBPP和APPS三个主流代码生成基准上评估CYCLE,结果表明:CYCLE成功保持甚至提升了一次性代码生成质量,同时显著增强了代码语言模型的自我优化能力。我们实现了参数规模分别为350M、1B、2B和3B的四种CYCLE变体,实验证明CYCLE在不同基准和模型规模上持续提升代码生成性能,最高提升达63.5%。我们还发现,CYCLE在自我优化任务中超越了参数规模大3倍的代码语言模型。