Software process models play a pivotal role in fostering collaboration and communication within software teams, enabling them to tackle intricate development tasks effectively. This paper introduces LCG, a code generation framework inspired by established software engineering practices. LCG leverages multiple Large Language Model (LLM) agents to emulate various software process models, namely LCGWaterfall, LCGTDD, and LCGScrum. Each model assigns LLM agents specific roles such as requirement engineer, architect, developer, tester, and scrum master, mirroring typical development activities and communication patterns. Through collaborative efforts utilizing chain-of-thought and prompt composition techniques, the agents continuously refine themselves to enhance code quality. Utilizing GPT3.5 as the underlying LLM and baseline (GPT), we evaluate LCG across four code generation benchmarks: HumanEval, HumanEval-ET, MBPP, and MBPP-ET. Results indicate LCGScrum outperforms other models, achieving Pass@1 scores of 75.2, 65.5, 82.5, and 56.7 in HumanEval, HumanEval-ET, MBPP, and MBPP-ET, respectively - an average 15% improvement over GPT. Analysis reveals distinct impacts of development activities on generated code, with design and code reviews contributing to enhanced exception handling, while design, testing, and code reviews mitigate code smells. Furthermore, temperature values exhibit negligible influence on Pass@1 across all models. However, variations in Pass@1 are notable for different GPT3.5 model versions, ranging from 5 to over 60 in HumanEval, highlighting the stability of LCG across model versions. This stability underscores the importance of adopting software process models to bolster the quality and consistency of LLM-generated code.
翻译:软件过程模型在促进软件团队协作与沟通、有效应对复杂开发任务方面发挥着关键作用。本文提出LCG框架——一种受软件工程实践启发的代码生成框架。LCG利用多个大语言模型(LLM)智能体模拟不同软件过程模型,具体包括LCGWaterfall、LCGTDD和LCGScrum。每种模型为LLM智能体分配需求工程师、架构师、开发人员、测试工程师及Scrum Master等角色,模拟典型开发活动与沟通模式。通过链式思考与提示组合技术的协作,智能体持续自我优化以提升代码质量。以GPT3.5作为底层LLM与基线模型(GPT),我们在四个代码生成基准测试(HumanEval、HumanEval-ET、MBPP和MBPP-ET)上评估LCG性能。结果表明LCGScrum表现最优,在HumanEval、HumanEval-ET、MBPP和MBPP-ET上分别取得75.2%、65.5%、82.5%和56.7%的Pass@1得分,平均较GPT提升15%。分析显示开发活动对生成代码具有差异化影响:设计与代码审查增强异常处理能力,而设计、测试与代码审查减少代码异味。此外,所有模型的温度参数对Pass@1影响微乎其微,但不同GPT3.5模型版本的Pass@1差异显著(HumanEval上跨度从5%到60%以上)。LCG在不同模型版本间保持稳定性,凸显采用软件过程模型对提升LLM生成代码质量与一致性的重要性。