Large Language Models (LLMs) and Generative Pre-trained Transformers (GPTs) are reshaping the field of Software Engineering (SE). Existing LLM-based multi-agent systems have successfully resolved simple dialogue tasks. However, the potential of LLMs for more complex tasks, such as automated code generation for large and complex projects, have been explored in only a few existing works. This paper introduces CodePori, a novel model designed to automate code generation for extensive and complex software projects based on natural language prompts. We employ LLM-based multi-AI agents to handle creative and challenging tasks in autonomous software development. Each agent engages with a specific task, including system design, code development, code review, code verification, and test engineering. We show in the paper that CodePori is able to generate running code for large-scale projects, completing the entire software development process in minutes rather than hours, and at a cost of a few dollars. It identifies and mitigates potential security vulnerabilities and corrects errors while maintaining a solid code performance level. We also conducted an evaluation of CodePori against existing solutions using HumanEval and the Massively Multitask Benchmark for Python (MBPP) benchmark. The results indicate that CodePori improves upon the benchmarks in terms of code accuracy, efficiency, and overall performance. For example, CodePori improves the pass@1 metric on HumanEval to 87.5% and on MBPP to 86.5%, representing a clear improvement over the existing models. We also assessed CodePori's performance through practitioner evaluations, with 91% expressing satisfaction with the model's performance.
翻译:大型语言模型(LLMs)与生成式预训练Transformer(GPTs)正重塑软件工程(SE)领域。现有基于LLM的多智能体系统已成功解决简单对话任务,但仅有少数研究探索了LLM在复杂任务(如大型项目自动化代码生成)中的潜力。本文提出CodePori——一种新型模型,旨在基于自然语言提示自动生成大型复杂软件项目的代码。我们采用基于LLM的多AI智能体处理自主软件开发中的创造性与挑战性任务。每个智能体专司特定职能,包括系统设计、代码开发、代码审查、代码验证及测试工程。本文表明,CodePori能够为大型项目生成可运行代码,在数分钟内而非数小时内完成整个软件开发流程,且成本仅需数美元。它在保持稳健代码性能的同时,识别并缓解潜在安全漏洞、修正错误。我们还利用HumanEval与Python多任务基准测试(MBPP)对CodePori与现有方案进行了对比评估。结果表明,CodePori在代码准确性、效率与整体性能上均优于基准。例如,CodePori将HumanEval的pass@1指标提升至87.5%,MBPP提升至86.5%,较现有模型实现了显著改进。此外,通过从业者评估验证CodePori性能,91%的参与者对该模型表现表示满意。