RGD: Multi-LLM Based Agent Debugger via Refinement and Generation Guidance

Large Language Models (LLMs) have shown incredible potential in code generation tasks, and recent research in prompt engineering have enhanced LLMs' understanding of textual information. However, ensuring the accuracy of generated code often requires extensive testing and validation by programmers. While LLMs can typically generate code based on task descriptions, their accuracy remains limited, especially for complex tasks that require a deeper understanding of both the problem statement and the code generation process. This limitation is primarily due to the LLMs' need to simultaneously comprehend text and generate syntactically and semantically correct code, without having the capability to automatically refine the code. In real-world software development, programmers rarely produce flawless code in a single attempt based on the task description alone, they rely on iterative feedback and debugging to refine their programs. Inspired by this process, we introduce a novel architecture of LLM-based agents for code generation and automatic debugging: Refinement and Guidance Debugging (RGD). The RGD framework is a multi-LLM-based agent debugger that leverages three distinct LLM agents-Guide Agent, Debug Agent, and Feedback Agent. RGD decomposes the code generation task into multiple steps, ensuring a clearer workflow and enabling iterative code refinement based on self-reflection and feedback. Experimental results demonstrate that RGD exhibits remarkable code generation capabilities, achieving state-of-the-art performance with a 9.8% improvement on the HumanEval dataset and a 16.2% improvement on the MBPP dataset compared to the state-of-the-art approaches and traditional direct prompting approaches. We highlight the effectiveness of the RGD framework in enhancing LLMs' ability to generate and refine code autonomously.

翻译：大型语言模型（LLM）在代码生成任务中展现出巨大潜力，而近期提示工程的研究进一步增强了LLM对文本信息的理解能力。然而，确保生成代码的准确性通常需要程序员进行大量测试与验证。尽管LLM通常能够根据任务描述生成代码，但其准确性仍然有限，特别是在需要深入理解问题描述和代码生成过程的复杂任务中。这一局限性主要源于LLM需要同时理解文本并生成语法和语义正确的代码，却缺乏自动优化代码的能力。在实际软件开发中，程序员很少仅凭任务描述就能一次性编写出完美代码，而是依赖迭代反馈和调试来优化程序。受此过程启发，我们提出了一种基于LLM的智能体代码生成与自动调试新架构：精炼与指导调试（RGD）。RGD框架是一个基于多LLM的智能体调试器，利用三个独立的LLM智能体——指导智能体、调试智能体和反馈智能体。RGD将代码生成任务分解为多个步骤，确保更清晰的工作流程，并支持基于自我反思和反馈的迭代式代码优化。实验结果表明，RGD展现出卓越的代码生成能力，在HumanEval数据集上相比现有最优方法和传统直接提示方法提升了9.8%，在MBPP数据集上提升了16.2%，实现了最先进的性能。我们强调了RGD框架在增强LLM自主生成与优化代码能力方面的有效性。