The FormAI Dataset: Generative AI in Software Security Through the Lens of Formal Verification

This paper presents the FormAI dataset, a large collection of 112, 000 AI-generated compilable and independent C programs with vulnerability classification. We introduce a dynamic zero-shot prompting technique constructed to spawn diverse programs utilizing Large Language Models (LLMs). The dataset is generated by GPT-3.5-turbo and comprises programs with varying levels of complexity. Some programs handle complicated tasks like network management, table games, or encryption, while others deal with simpler tasks like string manipulation. Every program is labeled with the vulnerabilities found within the source code, indicating the type, line number, and vulnerable function name. This is accomplished by employing a formal verification method using the Efficient SMT-based Bounded Model Checker (ESBMC), which uses model checking, abstract interpretation, constraint programming, and satisfiability modulo theories to reason over safety/security properties in programs. This approach definitively detects vulnerabilities and offers a formal model known as a counterexample, thus eliminating the possibility of generating false positive reports. We have associated the identified vulnerabilities with Common Weakness Enumeration (CWE) numbers. We make the source code available for the 112, 000 programs, accompanied by a separate file containing the vulnerabilities detected in each program, making the dataset ideal for training LLMs and machine learning algorithms. Our study unveiled that according to ESBMC, 51.24% of the programs generated by GPT-3.5 contained vulnerabilities, thereby presenting considerable risks to software safety and security.

翻译：摘要：本文介绍了FormAI数据集，这是一个包含112,000个AI生成、可编译且独立的C程序的大型集合，并附有漏洞分类。我们提出了一种动态零样本提示技术，旨在利用大型语言模型生成多样化程序。该数据集由GPT-3.5-turbo生成，包含复杂度各异的程序：部分程序处理网络管理、桌面游戏或加密等复杂任务，其余则处理字符串操作等简单任务。每个程序均标注了源代码中发现的漏洞，包括类型、行号和易受攻击的函数名称。这是通过使用基于高效SMT有界模型检查器（ESBMC）的形式化验证方法实现的，该方法结合模型检验、抽象解释、约束规划及可满足性模理论，以推理程序的安全/安保属性。该技术可确定性地检测漏洞，并提供称为反例的形式化模型，从而消除生成误报的可能性。我们将识别出的漏洞与通用缺陷枚举（CWE）编号相关联。我们提供了112,000个程序的源代码，并附带包含每个程序中检测到的漏洞的独立文件，使该数据集非常适合训练大型语言模型和机器学习算法。我们的研究揭示，根据ESBMC的检测结果，GPT-3.5生成的程序中51.24%存在漏洞，这对软件安全与安保构成了重大风险。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日