This study compares state-of-the-art Large Language Models (LLMs) on their tendency to generate vulnerabilities when writing C programs using a neutral zero-shot prompt. Tihanyi et al. introduced the FormAI dataset at PROMISE'23, featuring 112,000 C programs generated by GPT-3.5-turbo, with over 51.24% identified as vulnerable. We extended that research with a large-scale study involving 9 state-of-the-art models such as OpenAI's GPT-4o-mini, Google's Gemini Pro 1.0, TII's 180 billion-parameter Falcon, Meta's 13 billion-parameter Code Llama, and several other compact models. Additionally, we introduce the FormAI-v2 dataset, which comprises 331 000 compilable C programs generated by these LLMs. Each program in the dataset is labeled based on the vulnerabilities detected in its source code through formal verification, using the Efficient SMT-based Context-Bounded Model Checker (ESBMC). This technique minimizes false positives by providing a counterexample for the specific vulnerability and reduces false negatives by thoroughly completing the verification process. Our study reveals that at least 62.07% of the generated programs are vulnerable. The differences between the models are minor, as they all show similar coding errors with slight variations. Our research highlights that while LLMs offer promising capabilities for code generation, deploying their output in a production environment requires proper risk assessment and validation.
翻译:本研究比较了最先进的大型语言模型(LLMs)在使用中性零样本提示编写C程序时产生安全漏洞的倾向。Tihanyi等人在PROMISE'23会议上提出的FormAI数据集包含112,000个由GPT-3.5-turbo生成的C程序,其中超过51.24%被识别为存在漏洞。我们通过一项大规模研究扩展了该工作,涵盖了9个前沿模型,包括OpenAI的GPT-4o-mini、Google的Gemini Pro 1.0、TII的1800亿参数Falcon、Meta的130亿参数Code Llama以及若干紧凑型模型。此外,我们提出了FormAI-v2数据集,该数据集包含由这些LLMs生成的331,000个可编译C程序。数据集中的每个程序均通过形式化验证技术——基于高效SMT的上下文有界模型检测器(ESBMC)——对其源代码中检测到的漏洞进行标注。该技术通过为特定漏洞提供反例来最小化误报,并通过彻底完成验证过程来减少漏报。我们的研究表明,至少62.07%的生成程序存在漏洞。不同模型之间的差异较小,它们均表现出相似的编码错误模式,仅存在细微变化。本研究强调,尽管LLMs在代码生成方面展现出潜力,但在生产环境中部署其输出仍需进行充分的风险评估与验证。