This paper presents the FormAI dataset, a large collection of 112,000 AI-generated compilable and independent C programs with vulnerability classification. We introduce a dynamic zero-shot prompting technique, constructed to spawn a diverse set of programs utilizing Large Language Models (LLMs). The dataset is generated by GPT-3.5-turbo and comprises programs with varying levels of complexity. Some programs handle complicated tasks such as network management, table games, or encryption, while others deal with simpler tasks like string manipulation. Every program is labeled with the vulnerabilities found within the source code, indicating the type, line number, and vulnerable function name. This is accomplished by employing a formal verification method using the Efficient SMT-based Bounded Model Checker (ESBMC), which performs model checking, abstract interpretation, constraint programming, and satisfiability modulo theories, to reason over safety/security properties in programs. This approach definitively detects vulnerabilities and offers a formal model known as a counterexample, thus eliminating the possibility of generating false positive reports. This property of the dataset makes it suitable for evaluating the effectiveness of various static and dynamic analysis tools. Furthermore, we have associated the identified vulnerabilities with relevant Common Weakness Enumeration (CWE) numbers. We make the source code available for the 112,000 programs, accompanied by a comprehensive list detailing the vulnerabilities detected in each individual program including location and function name, which makes the dataset ideal to train LLMs and machine learning algorithms.
翻译:本文介绍了FormAI数据集——一个包含112,000个由AI生成、可编译且独立的C程序及其漏洞分类的大型集合。我们提出了一种动态零样本提示技术,旨在利用大型语言模型(LLMs)生成多样化的程序集合。该数据集由GPT-3.5-turbo生成,包含不同复杂程度的程序:部分程序处理网络管理、桌面游戏或加密等复杂任务,而其他程序则处理字符串操作等简单任务。每个程序均标注了源代码中发现的漏洞类型、所在行号和易受攻击函数名称。这一标注过程通过采用基于高效SMT有界模型检查器(ESBMC)的形式验证方法实现,该方法综合运用模型检测、抽象解释、约束规划与可满足性模理论,对程序的安全/防护属性进行推理。该方案可明确检测漏洞并输出名为反例的形式化模型,从而消除生成误报的可能性。数据集的这一特性使其适用于评估各类静态与动态分析工具的有效性。此外,我们将识别出的漏洞与相关通用缺陷枚举(CWE)编号相关联。我们公开发布了112,000个程序的源代码,并附有每份程序中检测到漏洞的详细清单(含位置与函数名称),使该数据集成为训练LLMs和机器学习算法的理想素材。