Pre-trained on massive amounts of code and text data, large language models (LLMs) have demonstrated remarkable achievements in performing code generation tasks. With additional execution-based feedback, these models can act as agents with capabilities to self-refine and improve generated code autonomously. However, on challenging coding tasks with extremely large search space, current agentic approaches still struggle with multi-stage planning, generating, and debugging. To address this problem, we propose CodeTree, a framework for LLM agents to efficiently explore the search space in different stages of the code generation process. Specifically, we adopted a unified tree structure to explicitly explore different coding strategies, generate corresponding coding solutions, and subsequently refine the solutions. In each stage, critical decision-making (ranking, termination, expanding) of the exploration process is guided by both the environmental execution-based feedback and LLM-agent-generated feedback. We comprehensively evaluated CodeTree on 7 code generation benchmarks and demonstrated the significant performance gains of CodeTree against strong baselines. Using GPT-4o as the base model, we consistently achieved top results of 95.1 on HumanEval, 98.7 on MBPP, and 43.0 on CodeContests. On the challenging SWEBench benchmark, our approach led to significant performance gains.
翻译:在大规模代码和文本数据上预训练的大语言模型(LLM)在代码生成任务中已展现出卓越的性能。通过引入基于执行的反馈机制,这些模型能够作为具备自主反思与改进能力的智能体。然而,面对搜索空间极大的复杂编码任务,现有智能体方法在多阶段规划、生成与调试方面仍面临挑战。为解决此问题,我们提出CodeTree框架,使LLM智能体能够在代码生成过程的不同阶段高效探索搜索空间。具体而言,我们采用统一的树形结构来显式探索不同的编码策略,生成相应的代码解决方案,并对其进行迭代优化。在每个探索阶段,关键决策(排序、终止、扩展)均由基于环境执行的反馈和LLM智能体生成的反馈共同引导。我们在7个代码生成基准测试上对CodeTree进行了全面评估,结果表明其相对于强基线模型取得了显著的性能提升。以GPT-4o为基础模型时,我们在HumanEval上达到95.1分,在MBPP上达到98.7分,在CodeContests上达到43.0分,均取得最优结果。在极具挑战性的SWEBench基准测试中,本方法亦实现了显著的性能突破。