How secure is AI-generated Code: A Large-Scale Comparison of Large Language Models

This study compares state-of-the-art Large Language Models (LLMs) on their tendency to generate vulnerabilities when writing C programs using a neutral zero-shot prompt. Tihanyi et al. introduced the FormAI dataset at PROMISE'23, featuring 112,000 C programs generated by GPT-3.5-turbo, with over 51.24% identified as vulnerable. We extended that research with a large-scale study involving 9 state-of-the-art models such as OpenAI's GPT-4o-mini, Google's Gemini Pro 1.0, TII's 180 billion-parameter Falcon, Meta's 13 billion-parameter Code Llama, and several other compact models. Additionally, we introduce the FormAI-v2 dataset, which comprises 331 000 compilable C programs generated by these LLMs. Each program in the dataset is labeled based on the vulnerabilities detected in its source code through formal verification, using the Efficient SMT-based Context-Bounded Model Checker (ESBMC). This technique minimizes false positives by providing a counterexample for the specific vulnerability and reduces false negatives by thoroughly completing the verification process. Our study reveals that at least 62.07% of the generated programs are vulnerable. The differences between the models are minor, as they all show similar coding errors with slight variations. Our research highlights that while LLMs offer promising capabilities for code generation, deploying their output in a production environment requires proper risk assessment and validation.

翻译：本研究比较了最先进的大型语言模型（LLMs）在使用中性零样本提示编写C程序时产生安全漏洞的倾向。Tihanyi等人在PROMISE'23会议上提出的FormAI数据集包含112,000个由GPT-3.5-turbo生成的C程序，其中超过51.24%被识别为存在漏洞。我们通过一项大规模研究扩展了该工作，涵盖了9个前沿模型，包括OpenAI的GPT-4o-mini、Google的Gemini Pro 1.0、TII的1800亿参数Falcon、Meta的130亿参数Code Llama以及若干紧凑型模型。此外，我们提出了FormAI-v2数据集，该数据集包含由这些LLMs生成的331,000个可编译C程序。数据集中的每个程序均通过形式化验证技术——基于高效SMT的上下文有界模型检测器（ESBMC）——对其源代码中检测到的漏洞进行标注。该技术通过为特定漏洞提供反例来最小化误报，并通过彻底完成验证过程来减少漏报。我们的研究表明，至少62.07%的生成程序存在漏洞。不同模型之间的差异较小，它们均表现出相似的编码错误模式，仅存在细微变化。本研究强调，尽管LLMs在代码生成方面展现出潜力，但在生产环境中部署其输出仍需进行充分的风险评估与验证。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日