Large Language Models (LLMs) show promise in code generation tasks. However, their code-writing abilities are often limited in scope: while they can successfully implement simple functions, they struggle with more complex tasks. A fundamental difference with how an LLM writes code, compared to a human programmer, is that it cannot consistently spot and fix bugs. Debugging is a crucial skill for programmers and it enables iterative code refinement towards a correct implementation. In this work, we propose a novel algorithm to enable LLMs to debug their code via self-reflection and search where a model attempts to identify its previous mistakes. Our key contributions are 1) a best-first tree search algorithm with self-reflections (BESTER) that achieves state-of-the-art Pass@1 in three code generation benchmarks. BESTER maintains its superiority when we measure pass rates taking into account additional inference costs incurred by tree search. 2) A novel interpretability study on what self-reflections attend to in buggy programs and how they impact bug fixes, which provides a deeper understanding of the debugging process. 3) An extensive study on when self-reflections are effective in finding bugs.
翻译:大语言模型(LLMs)在代码生成任务中展现出潜力,但其代码编写能力通常存在局限性:虽然能成功实现简单函数,却难以应对更复杂的任务。与人类程序员相比,LLMs编写代码的根本差异在于无法持续发现并修复错误。调试是程序员的核心技能,它能通过迭代式代码精炼实现正确实现。本研究提出一种新颖算法,使LLMs能够通过自省与搜索机制调试自身代码,即模型尝试识别先前错误。我们的核心贡献包括:1)提出融合自省机制的最佳优先树搜索算法(BESTER),该算法在三个代码生成基准测试中达到最优的Pass@1性能。即使考虑树搜索带来的额外推理成本,BESTER在通过率指标上仍保持优势;2)针对自省机制在缺陷程序中的关注焦点及其对错误修复的影响,开展创新性可解释性研究,从而深化对调试过程的理解;3)系统探究自省机制在何种条件下能有效发现程序缺陷。