The FormAI Dataset: Generative AI in Software Security Through the Lens of Formal Verification

This paper presents the FormAI dataset, a large collection of 112,000 AI-generated compilable and independent C programs with vulnerability classification. We introduce a dynamic zero-shot prompting technique, constructed to spawn a diverse set of programs utilizing Large Language Models (LLMs). The dataset is generated by GPT-3.5-turbo and comprises programs with varying levels of complexity. Some programs handle complicated tasks such as network management, table games, or encryption, while others deal with simpler tasks like string manipulation. Every program is labeled with the vulnerabilities found within the source code, indicating the type, line number, and vulnerable function name. This is accomplished by employing a formal verification method using the Efficient SMT-based Bounded Model Checker (ESBMC), which performs model checking, abstract interpretation, constraint programming, and satisfiability modulo theories, to reason over safety/security properties in programs. This approach definitively detects vulnerabilities and offers a formal model known as a counterexample, thus eliminating the possibility of generating false positive reports. This property of the dataset makes it suitable for evaluating the effectiveness of various static and dynamic analysis tools. Furthermore, we have associated the identified vulnerabilities with relevant Common Weakness Enumeration (CWE) numbers. We make the source code available for the 112,000 programs, accompanied by a comprehensive list detailing the vulnerabilities detected in each individual program including location and function name, which makes the dataset ideal to train LLMs and machine learning algorithms.

翻译：本文介绍了FormAI数据集——一个包含112,000个由AI生成、可编译且独立的C程序及其漏洞分类的大型集合。我们提出了一种动态零样本提示技术，旨在利用大型语言模型（LLMs）生成多样化的程序集合。该数据集由GPT-3.5-turbo生成，包含不同复杂程度的程序：部分程序处理网络管理、桌面游戏或加密等复杂任务，而其他程序则处理字符串操作等简单任务。每个程序均标注了源代码中发现的漏洞类型、所在行号和易受攻击函数名称。这一标注过程通过采用基于高效SMT有界模型检查器（ESBMC）的形式验证方法实现，该方法综合运用模型检测、抽象解释、约束规划与可满足性模理论，对程序的安全/防护属性进行推理。该方案可明确检测漏洞并输出名为反例的形式化模型，从而消除生成误报的可能性。数据集的这一特性使其适用于评估各类静态与动态分析工具的有效性。此外，我们将识别出的漏洞与相关通用缺陷枚举（CWE）编号相关联。我们公开发布了112,000个程序的源代码，并附有每份程序中检测到漏洞的详细清单（含位置与函数名称），使该数据集成为训练LLMs和机器学习算法的理想素材。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日