Large language models (LLMs) have achieved impressive performance on code generation. However, for complex programming tasks, generating the correct solution in one go becomes challenging, thus some prior works have designed program repair approaches to improve code generation performance. In this work, we propose Self-Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations. In particular, we demonstrate that Self-Debugging can teach the large language model to perform rubber duck debugging; i.e., without any human feedback on the code correctness or error messages, the model is able to identify its mistakes by investigating the execution results and explaining the generated code in natural language. Self-Debugging achieves the state-of-the-art performance on several code generation benchmarks, including the Spider dataset for text-to-SQL generation, TransCoder for C++-to-Python translation, and MBPP for text-to-Python generation. On the Spider benchmark where there are no unit tests to verify the correctness of predictions, Self-Debugging with code explanation consistently improves the baseline by 2-3%, and improves the prediction accuracy on problems of the hardest level by 9%. On TransCoder and MBPP where unit tests are available, Self-Debugging improves the baseline accuracy by up to 12%. Meanwhile, by leveraging feedback messages and reusing failed predictions, Self-Debugging notably improves sample efficiency, and can match or outperform baseline models that generate more than 10x candidate programs.
翻译:大型语言模型(LLMs)在代码生成任务中已取得显著成效。然而,面对复杂编程任务时,模型难以一次性生成正确答案,因此先前研究设计了程序修复方法以提升代码生成性能。本文提出自我调试(Self-Debugging)方法,通过少量示例示范教会大型语言模型调试其预测的程序。具体而言,我们证明自我调试可引导大型语言模型执行"橡皮鸭调试":即无需任何关于代码正确性或错误信息的人工反馈,模型仅通过检查执行结果并用自然语言解释生成代码,便能自主识别错误。该方法在多个代码生成基准测试中达到了最先进水平,包括用于文本到SQL生成的Spider数据集、C++到Python翻译的TransCoder、以及文本到Python生成的MBPP。在缺乏单元测试验证预测正确性的Spider基准中,结合代码解释的自我调试方法持续将基线性能提升2-3%,并在最困难级别的问题上提高9%的预测准确率。在具备单元测试的TransCoder和MBPP基准中,自我调试将基线准确率提升高达12%。同时,通过利用反馈消息和复用失败预测结果,自我调试显著提高了样本效率,其性能可媲美甚至超越生成超过10倍候选程序的基线模型。