This paper presents the FormAI dataset, a large collection of 112, 000 AI-generated compilable and independent C programs with vulnerability classification. We introduce a dynamic zero-shot prompting technique constructed to spawn diverse programs utilizing Large Language Models (LLMs). The dataset is generated by GPT-3.5-turbo and comprises programs with varying levels of complexity. Some programs handle complicated tasks like network management, table games, or encryption, while others deal with simpler tasks like string manipulation. Every program is labeled with the vulnerabilities found within the source code, indicating the type, line number, and vulnerable function name. This is accomplished by employing a formal verification method using the Efficient SMT-based Bounded Model Checker (ESBMC), which uses model checking, abstract interpretation, constraint programming, and satisfiability modulo theories to reason over safety/security properties in programs. This approach definitively detects vulnerabilities and offers a formal model known as a counterexample, thus eliminating the possibility of generating false positive reports. We have associated the identified vulnerabilities with Common Weakness Enumeration (CWE) numbers. We make the source code available for the 112, 000 programs, accompanied by a separate file containing the vulnerabilities detected in each program, making the dataset ideal for training LLMs and machine learning algorithms. Our study unveiled that according to ESBMC, 51.24% of the programs generated by GPT-3.5 contained vulnerabilities, thereby presenting considerable risks to software safety and security.
翻译:摘要:本文介绍了FormAI数据集,这是一个包含112,000个AI生成、可编译且独立的C程序的大型集合,并附有漏洞分类。我们提出了一种动态零样本提示技术,旨在利用大型语言模型生成多样化程序。该数据集由GPT-3.5-turbo生成,包含复杂度各异的程序:部分程序处理网络管理、桌面游戏或加密等复杂任务,其余则处理字符串操作等简单任务。每个程序均标注了源代码中发现的漏洞,包括类型、行号和易受攻击的函数名称。这是通过使用基于高效SMT有界模型检查器(ESBMC)的形式化验证方法实现的,该方法结合模型检验、抽象解释、约束规划及可满足性模理论,以推理程序的安全/安保属性。该技术可确定性地检测漏洞,并提供称为反例的形式化模型,从而消除生成误报的可能性。我们将识别出的漏洞与通用缺陷枚举(CWE)编号相关联。我们提供了112,000个程序的源代码,并附带包含每个程序中检测到的漏洞的独立文件,使该数据集非常适合训练大型语言模型和机器学习算法。我们的研究揭示,根据ESBMC的检测结果,GPT-3.5生成的程序中51.24%存在漏洞,这对软件安全与安保构成了重大风险。