Automated Vulnerability Detection in Source Code Using Deep Representation Learning

Each year, software vulnerabilities are discovered, which pose significant risks of exploitation and system compromise. We present a convolutional neural network model that can successfully identify bugs in C code. We trained our model using two complementary datasets: a machine-labeled dataset created by Draper Labs using three static analyzers and the NIST SATE Juliet human-labeled dataset designed for testing static analyzers. In contrast with the work of Russell et al. on these datasets, we focus on C programs, enabling us to specialize and optimize our detection techniques for this language. After removing duplicates from the dataset, we tokenize the input into 91 token categories. The category values are converted to a binary vector to save memory. Our first convolution layer is chosen so that the entire encoding of the token is presented to the filter. We use two convolution and pooling layers followed by two fully connected layers to classify programs into either a common weakness enumeration category or as ``clean.'' We obtain higher recall than prior work by Russell et al. on this dataset when requiring high precision. We also demonstrate on a custom Linux kernel dataset that we are able to find real vulnerabilities in complex code with a low false-positive rate.

翻译：每年发现的软件漏洞都带来严重的利用风险和系统危害。本文提出一种能够成功识别C代码中缺陷的卷积神经网络模型。我们使用两个互补数据集训练模型：由Draper实验室使用三种静态分析器创建的机器标注数据集，以及专为测试静态分析器设计的NIST SATE Juliet人工标注数据集。与Russell等人基于这些数据集的研究不同，我们专注于C程序，从而能够针对该语言专门优化检测技术。在去除数据集中的重复项后，我们将输入标记化为91种标记类别。为节省内存，将类别值转换为二进制向量。我们选择的首个卷积层确保标记的完整编码能够呈现给滤波器。通过两个卷积层和池化层，以及两个全连接层，将程序分类为通用缺陷枚举类别或"清洁"状态。在要求高精度的前提下，我们在该数据集上获得了比Russell等人先前研究更高的召回率。通过在自定义Linux内核数据集上的实验，我们证明了该模型能够以较低误报率在复杂代码中发现真实漏洞。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

ACM Computing Surveys | 港大等基于可靠性视角的深度伪造检测综述，覆盖主流基准库、模型

专知会员服务

17+阅读 · 2025年1月12日

【博士论文】无监督深度图聚类中的自适应表示学习，144页pdf

专知会员服务

43+阅读 · 2023年10月21日

《利用深度学习进行目标姿态估计》2023最新63页论文

专知会员服务

48+阅读 · 2023年8月29日

弹药异常检测《使用机器学习进行缺陷表征》最佳论文，MODSIM World 2023

专知会员服务

37+阅读 · 2023年7月22日