The advancement of natural language processing (NLP) has been significantly boosted by the development of transformer-based large language models (LLMs). These models have revolutionized NLP tasks, particularly in code generation, aiding developers in creating software with enhanced efficiency. Despite their advancements, challenges in balancing code snippet generation with effective test case generation and execution persist. To address these issues, this paper introduces Multi-Agent Assistant Code Generation (AgentCoder), a novel solution comprising a multi-agent framework with specialized agents: the programmer agent, the test designer agent, and the test executor agent. During the coding procedure, the programmer agent will focus on the code generation and refinement based on the test executor agent's feedback. The test designer agent will generate test cases for the generated code, and the test executor agent will run the code with the test cases and write the feedback to the programmer. This collaborative system ensures robust code generation, surpassing the limitations of single-agent models and traditional methodologies. Our extensive experiments on 9 code generation models and 12 enhancement approaches showcase AgentCoder's superior performance over existing code generation models and prompt engineering techniques across various benchmarks. For example, AgentCoder (GPT-4) achieves 96.3\% and 91.8\% pass@1 in HumanEval and MBPP datasets with an overall token overhead of 56.9K and 66.3K, while state-of-the-art obtains only 90.2\% and 78.9\% pass@1 with an overall token overhead of 138.2K and 206.5K.
翻译:自然语言处理(NLP)的进步因基于Transformer的大语言模型(LLMs)的发展而得到显著推动。这些模型彻底改变了NLP任务,特别是在代码生成领域,帮助开发者以更高的效率创建软件。尽管取得了这些进展,但在平衡代码片段生成与有效的测试用例生成及执行方面仍存在挑战。为解决这些问题,本文提出了多智能体辅助代码生成(AgentCoder),这是一种新颖的解决方案,包含一个由专门智能体构成的多智能体框架:程序员智能体、测试设计智能体和测试执行智能体。在编码过程中,程序员智能体将专注于基于测试执行智能体的反馈进行代码生成与优化。测试设计智能体将为生成的代码生成测试用例,而测试执行智能体将使用这些测试用例运行代码,并向程序员提供反馈。这种协作系统确保了鲁棒的代码生成,超越了单智能体模型和传统方法的局限。我们在9个代码生成模型和12种增强方法上进行的广泛实验表明,AgentCoder在多个基准测试中均优于现有的代码生成模型和提示工程技术。例如,AgentCoder(GPT-4)在HumanEval和MBPP数据集上分别实现了96.3%和91.8%的pass@1,总token开销为56.9K和66.3K,而现有最优方法仅获得90.2%和78.9%的pass@1,总token开销却高达138.2K和206.5K。