The strong performance of large language models (LLMs) on natural language processing tasks raises extensive discussion on their application to code generation. Recent work suggests multiple sampling approaches to improve initial code generation accuracy or program repair approaches to refine the code. However, these methods suffer from LLMs' inefficiencies and limited reasoning capacity. In this work, we propose an LLM programming workflow (LPW) designed to improve both initial code generation and subsequent refinements within a structured two-phase workflow. Specifically, in the solution generation phase, the LLM first outlines a solution plan that decomposes the problem into manageable sub-problems and then verifies the generated solution plan through visible test cases. Subsequently, in the code implementation phase, the LLM initially drafts a code according to the solution plan and its verification. If the generated code fails the visible tests, the plan verification serves as the intended natural language solution to inform the refinement process for correcting bugs. We further introduce SLPW, a sampling variant of LPW, which initially generates multiple solution plans and plan verifications, produces a program for each plan and its verification, and refines each program as necessary until one successfully passes the visible tests. Compared to the state-of-the-art methods across various existing LLMs, our experimental results show that LPW significantly improves the Pass@1 accuracy by up to 16.4% on well-established text-to-code generation benchmarks, especially with a notable improvement of around 10% on challenging benchmarks. Additionally, SLPW demonstrates up to a 5.6% improvement over LPW and sets new state-of-the-art Pass@1 accuracy on various benchmarks, e.g., 98.2% on HumanEval, 84.8% on MBPP, 64.0% on APPS, and 35.3% on CodeContest, using GPT-4o as the backbone.
翻译:大语言模型在自然语言处理任务上的强劲表现引发了关于其应用于代码生成的广泛讨论。近期研究提出了多种采样方法以提高初始代码生成准确率,或采用程序修复方法来改进代码。然而,这些方法受限于大语言模型的低效性和有限的推理能力。本文提出一种大语言模型编程工作流程,旨在通过结构化的两阶段工作流程同时改进初始代码生成和后续优化。具体而言,在解决方案生成阶段,LLM首先制定一个解决方案规划,将问题分解为可管理的子问题,然后通过可见测试用例验证生成的解决方案规划。随后,在代码实现阶段,LLM根据解决方案规划及其验证结果初步起草代码。若生成的代码未能通过可见测试,规划验证结果将作为预期的自然语言解决方案,为修复错误的优化过程提供信息。我们进一步提出了SLPW,即LPW的采样变体,其首先生成多个解决方案规划和规划验证结果,为每个规划及其验证结果生成程序,并在必要时优化每个程序,直至有一个程序成功通过可见测试。与现有各种LLM上的最先进方法相比,我们的实验结果表明,LPW在成熟的文本到代码生成基准测试上将Pass@1准确率显著提升了高达16.4%,尤其是在具有挑战性的基准测试上实现了约10%的显著提升。此外,SLPW相比LPW进一步提升了高达5.6%,并在多个基准测试上创造了新的最先进Pass@1准确率记录,例如使用GPT-4o作为骨干模型时,在HumanEval上达到98.2%,在MBPP上达到84.8%,在APPS上达到64.0%,在CodeContest上达到35.3%。