SelfEvolve: A Code Evolution Framework via Large Language Models

Large language models (LLMs) have already revolutionized code generation, after being pretrained on publicly available code data. However, while various methods have been proposed to augment LLMs with retrieved knowledge and enhance the quality of code generation, the performance of these retrieval-based methods is limited by the strength of the retrievers used. In addition, while LLMs show great emergent ability, they still struggle to produce the correct code in one turn. To address these challenges, we propose a novel two-step pipeline, called \autoknow, that leverages LLMs as both knowledge providers and self-reflective programmers. Unlike retrieval-based methods, \autoknow~obtains the knowledge from input prompts and generates intermediate code based on the generated knowledge. After that, \autoknow~asks LLM to act as an expert programmer to perform debugging for the generated code. This is achieved by receiving the error message from the interpreter, without requiring special test cases for correctness verification. We evaluate \autoknow~on three code generation datasets, including DS-1000 for data science code, HumanEval for software engineering code, and TransCoder for C++-to-Python translation. Our empirical experiments show that \autoknow~outperforms strong baselines by a significant margin on all datasets. We also conduct exhaustive analytical experiments to validate the effectiveness of the two stages of \autoknow, and find that both are superior to other prompting-based methods. Further scalability analysis demonstrates that \autoknow~can be adapted to other more advanced models, such as GPT-4, and bring consistent efficacy improvement.

翻译：摘要：大型语言模型（LLMs）在公开代码数据上预训练后，已彻底革新了代码生成技术。尽管已有多种方法通过增强LLMs的检索知识来提升代码质量，但这类检索方法的性能仍受限于检索器能力。此外，尽管LLMs展现出强大的涌现能力，但在单次交互中生成正确代码仍存在困难。为应对这些挑战，我们提出名为\autoknow~的新型双阶段流水线，该框架将LLMs同时作为知识提供者与自我反思型程序员。不同于检索方法，\autoknow~从输入提示中获取知识并基于生成知识产生中间代码，随后要求LLM扮演专家程序员对生成代码进行调试。此过程通过接收解释器错误信息实现，无需特殊测试用例验证正确性。我们在三个代码生成数据集（面向数据科学代码的DS-1000、软件工程代码的HumanEval、C++到Python翻译任务的TransCoder）上评估\autoknow~，实验表明其在所有数据集上显著超越强基线模型。通过详尽的消融实验验证\autoknow~两个阶段的有效性，发现两阶段均优于其他基于提示的方法。进一步的可扩展性分析证明，\autoknow~可适配GPT-4等更先进模型，并持续提升生成效能。