Pre-trained on massive amounts of code and text data, large language models (LLMs) have demonstrated remarkable achievements in performing code generation tasks. With additional execution-based feedback, these models can act as agents with capabilities to self-refine and improve generated code autonomously. However, on challenging coding tasks with extremely large search space, current agentic approaches still struggle with multi-stage planning, generating, and debugging. To address this problem, we propose CodeTree, a framework for LLM agents to efficiently explore the search space in different stages of the code generation process. Specifically, we adopted a unified tree structure to explicitly explore different coding strategies, generate corresponding coding solutions, and subsequently refine the solutions. In each stage, critical decision-making (ranking, termination, expanding) of the exploration process is guided by both the environmental execution-based feedback and LLM-agent-generated feedback. We comprehensively evaluated CodeTree on 7 code generation benchmarks and demonstrated the significant performance gains of CodeTree against strong baselines. Using GPT-4o as the base model, we consistently achieved top results of 95.1 on HumanEval, 98.7 on MBPP, and 43.0 on CodeContests. On the challenging SWEBench benchmark, our approach led to significant performance gains.
翻译:通过在海量代码与文本数据上进行预训练,大语言模型(LLMs)在代码生成任务中展现出卓越的性能。结合基于执行的反馈,这些模型能够作为具备自我反思与自主改进生成代码能力的智能体。然而,在面对搜索空间极其庞大的复杂编码任务时,现有的智能体方法仍在多阶段规划、生成与调试方面面临挑战。为解决这一问题,我们提出了CodeTree,一个使LLM智能体能够在代码生成过程的不同阶段高效探索搜索空间的框架。具体而言,我们采用统一的树结构来显式探索不同的编码策略、生成相应的编码解决方案,并随后对解决方案进行优化。在每一阶段,探索过程中的关键决策(排序、终止、扩展)均由基于环境的执行反馈和LLM智能体生成的反馈共同引导。我们在7个代码生成基准测试上对CodeTree进行了全面评估,结果表明CodeTree相较于强基线模型取得了显著的性能提升。以GPT-4o作为基础模型,我们在HumanEval上持续取得了95.1的最高分,在MBPP上取得98.7分,在CodeContests上取得43.0分。在极具挑战性的SWEBench基准测试中,我们的方法亦带来了显著的性能提升。