When LLM-based Code Generation Meets the Software Development Process

Software process models play a pivotal role in fostering collaboration and communication within software teams, enabling them to tackle intricate development tasks effectively. This paper introduces LCG, a code generation framework inspired by established software engineering practices. LCG leverages multiple Large Language Model (LLM) agents to emulate various software process models, namely LCGWaterfall, LCGTDD, and LCGScrum. Each model assigns LLM agents specific roles such as requirement engineer, architect, developer, tester, and scrum master, mirroring typical development activities and communication patterns. Through collaborative efforts utilizing chain-of-thought and prompt composition techniques, the agents continuously refine themselves to enhance code quality. Utilizing GPT3.5 as the underlying LLM and baseline (GPT), we evaluate LCG across four code generation benchmarks: HumanEval, HumanEval-ET, MBPP, and MBPP-ET. Results indicate LCGScrum outperforms other models, achieving Pass@1 scores of 75.2, 65.5, 82.5, and 56.7 in HumanEval, HumanEval-ET, MBPP, and MBPP-ET, respectively - an average 15% improvement over GPT. Analysis reveals distinct impacts of development activities on generated code, with design and code reviews contributing to enhanced exception handling, while design, testing, and code reviews mitigate code smells. Furthermore, temperature values exhibit negligible influence on Pass@1 across all models. However, variations in Pass@1 are notable for different GPT3.5 model versions, ranging from 5 to over 60 in HumanEval, highlighting the stability of LCG across model versions. This stability underscores the importance of adopting software process models to bolster the quality and consistency of LLM-generated code.

翻译：软件过程模型在促进软件团队协作与沟通、有效应对复杂开发任务方面发挥着关键作用。本文提出LCG框架——一种受软件工程实践启发的代码生成框架。LCG利用多个大语言模型（LLM）智能体模拟不同软件过程模型，具体包括LCGWaterfall、LCGTDD和LCGScrum。每种模型为LLM智能体分配需求工程师、架构师、开发人员、测试工程师及Scrum Master等角色，模拟典型开发活动与沟通模式。通过链式思考与提示组合技术的协作，智能体持续自我优化以提升代码质量。以GPT3.5作为底层LLM与基线模型（GPT），我们在四个代码生成基准测试（HumanEval、HumanEval-ET、MBPP和MBPP-ET）上评估LCG性能。结果表明LCGScrum表现最优，在HumanEval、HumanEval-ET、MBPP和MBPP-ET上分别取得75.2%、65.5%、82.5%和56.7%的Pass@1得分，平均较GPT提升15%。分析显示开发活动对生成代码具有差异化影响：设计与代码审查增强异常处理能力，而设计、测试与代码审查减少代码异味。此外，所有模型的温度参数对Pass@1影响微乎其微，但不同GPT3.5模型版本的Pass@1差异显著（HumanEval上跨度从5%到60%以上）。LCG在不同模型版本间保持稳定性，凸显采用软件过程模型对提升LLM生成代码质量与一致性的重要性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日