Do Neutral Prompts Produce Insecure Code? FormAI-v2 Dataset: Labelling Vulnerabilities in Code Generated by Large Language Models

This study provides a comparative analysis of state-of-the-art large language models (LLMs), analyzing how likely they generate vulnerabilities when writing simple C programs using a neutral zero-shot prompt. We address a significant gap in the literature concerning the security properties of code produced by these models without specific directives. N. Tihanyi et al. introduced the FormAI dataset at PROMISE '23, containing 112,000 GPT-3.5-generated C programs, with over 51.24% identified as vulnerable. We expand that work by introducing the FormAI-v2 dataset comprising 265,000 compilable C programs generated using various LLMs, including robust models such as Google's GEMINI-pro, OpenAI's GPT-4, and TII's 180 billion-parameter Falcon, to Meta's specialized 13 billion-parameter CodeLLama2 and various other compact models. Each program in the dataset is labelled based on the vulnerabilities detected in its source code through formal verification using the Efficient SMT-based Context-Bounded Model Checker (ESBMC). This technique eliminates false positives by delivering a counterexample and ensures the exclusion of false negatives by completing the verification process. Our study reveals that at least 63.47% of the generated programs are vulnerable. The differences between the models are minor, as they all display similar coding errors with slight variations. Our research highlights that while LLMs offer promising capabilities for code generation, deploying their output in a production environment requires risk assessment and validation.

翻译：本研究对当前最先进的大型语言模型（LLMs）进行了比较分析，考察了在使用中性零样本提示编写简单C程序时，这些模型生成漏洞的可能性。我们填补了文献中关于这些模型在无特定指令条件下所生成代码安全属性的重大空白。N. Tihanyi等人在PROMISE '23会议上提出了FormAI数据集，包含112,000个由GPT-3.5生成的C程序，其中超过51.24%被识别为存在漏洞。我们通过引入FormAI-v2数据集扩展了该工作，该数据集包含265,000个由多种LLMs生成的可编译C程序，涵盖谷歌的GEMINI-pro、OpenAI的GPT-4、TII的1800亿参数Falcon等强大模型，以及Meta专门的130亿参数CodeLLama2和其他各类紧凑型模型。数据集中每个程序均基于其源代码中检测到的漏洞进行标注，采用基于高效SMT上下文边界模型检查器（ESBMC）的形式化验证技术。该技术通过提供反例消除了误报，并通过完成验证过程确保了漏报的排除。我们的研究显示，至少63.47%的生成程序存在漏洞。各模型间差异较小，均表现出相似的编码错误，仅有细微变化。我们的研究强调，尽管LLMs在代码生成方面展现出有前景的能力，但在生产环境中部署其输出仍需进行风险评估和验证。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/