While large language models have made significant strides in code generation, the pass rate of the generated code is bottlenecked on subtle errors, often requiring human intervention to pass tests, especially for complex problems. Existing LLM-based debugging systems treat generated programs as monolithic units, failing to address bugs at multiple levels of granularity, from low-level syntax errors to high-level algorithmic flaws. In this paper, we introduce Multi-Granularity Debugger (MGDebugger), a hierarchical code debugger by isolating, identifying, and resolving bugs at various levels of granularity. MGDebugger decomposes problematic code into a hierarchical tree structure of subfunctions, with each level representing a particular granularity of error. During debugging, it analyzes each subfunction and iteratively resolves bugs in a bottom-up manner. To effectively test each subfunction, we propose an LLM-simulated Python executor, which traces code execution and tracks important variable states to pinpoint errors accurately. Extensive experiments demonstrate that MGDebugger outperforms existing debugging systems, achieving an 18.9% improvement in accuracy over seed generations in HumanEval and a 97.6% repair success rate in HumanEvalFix. Furthermore, MGDebugger effectively fixes bugs across different categories and difficulty levels, demonstrating its robustness and effectiveness.
翻译:尽管大型语言模型在代码生成方面取得了显著进展,但生成代码的通过率仍受限于细微错误,通常需要人工干预才能通过测试,尤其是在处理复杂问题时。现有的基于LLM的调试系统将生成的程序视为整体单元,未能解决从低级语法错误到高级算法缺陷的多粒度错误。本文提出多粒度调试器(MGDebugger),这是一种通过在不同粒度级别上隔离、识别和修复错误的分层代码调试器。MGDebugger将有问题的代码分解为子函数的层次树结构,每个层级代表特定粒度的错误。在调试过程中,它会分析每个子函数,并以自底向上的方式迭代修复错误。为了有效测试每个子函数,我们提出了一种LLM模拟的Python执行器,该执行器通过追踪代码执行过程并跟踪重要变量状态来准确定位错误。大量实验表明,MGDebugger优于现有调试系统,在HumanEval上比初始生成准确率提升18.9%,在HumanEvalFix中达到97.6%的修复成功率。此外,MGDebugger能有效修复不同类别和难度级别的错误,证明了其鲁棒性和有效性。