Automatically generating source code from natural language using large language models (LLMs) is becoming common, yet security vulnerabilities persist despite advances in fine tuning and prompting. In this work, we systematically evaluate whether multi LLM ensembles and collaborative strategies can meaningfully improve secure code generation. We present MULTI-LLMSECCODEEVAL, a framework for assessing and enhancing security across the vulnerability management lifecycle by combining multiple LLMs with static analysis and structured collaboration. Using SecLLMEval and SecLLMHolmes, we benchmark ten pipelines spanning single model, ensemble, collaborative, and hybrid designs. Our results show that ensemble pipelines augmented with static analysis improve secure code generation over single LLM baselines by up to 47.3% on SecLLMEval and 19.3% on SecLLMHolmes, while purely LLM based collaborative pipelines yield smaller gains of 8.9% to 22.3%. Hybrid pipelines that integrate ensembling, detection, and patching achieve the strongest security performance, outperforming the best ensemble baseline by 1.78% to 4.72% and collaborative baselines by 19.81% to 26.78%. Ablation studies reveal that model scale alone does not ensure security. Smaller, structured multi model ensembles consistently outperform large monolithic LLMs. Overall, our findings demonstrate that secure code does not emerge from scale, but from carefully orchestrated multi model system design.
翻译:利用大语言模型从自然语言自动生成源代码已逐渐普及,尽管微调与提示工程技术不断进步,但安全漏洞问题依然存在。本研究系统评估了多LLM集成与协作策略能否有效改善安全代码生成。我们提出MULTI-LLMSECCODEEVAL框架,通过结合多个LLM、静态分析与结构化协作机制,在漏洞管理全生命周期中评估并增强代码安全性。基于SecLLMEval和SecLLMHolmes基准测试,我们对涵盖单模型、集成、协作与混合设计的十种管道进行了评估。结果表明,经静态分析增强的集成管道相较于单LLM基线,在SecLLMEval上安全代码生成性能提升达47.3%,在SecLLMHolmes上提升19.3%,而纯LLM协作管道仅实现8.9%至22.3%的较小增益。融合集成、检测与修复的混合管道实现了最优安全性能,其表现分别超出最佳集成基线1.78%至4.72%,超出协作基线19.81%至26.78%。消融实验揭示,模型规模本身无法保障安全性,结构化的小型多模型集成始终优于大型单体LLM。总体而言,我们的研究证明:安全代码并非源于模型规模,而是源于精心编排的多模型系统设计。