Each year, software vulnerabilities are discovered, which pose significant risks of exploitation and system compromise. We present a convolutional neural network model that can successfully identify bugs in C code. We trained our model using two complementary datasets: a machine-labeled dataset created by Draper Labs using three static analyzers and the NIST SATE Juliet human-labeled dataset designed for testing static analyzers. In contrast with the work of Russell et al. on these datasets, we focus on C programs, enabling us to specialize and optimize our detection techniques for this language. After removing duplicates from the dataset, we tokenize the input into 91 token categories. The category values are converted to a binary vector to save memory. Our first convolution layer is chosen so that the entire encoding of the token is presented to the filter. We use two convolution and pooling layers followed by two fully connected layers to classify programs into either a common weakness enumeration category or as ``clean.'' We obtain higher recall than prior work by Russell et al. on this dataset when requiring high precision. We also demonstrate on a custom Linux kernel dataset that we are able to find real vulnerabilities in complex code with a low false-positive rate.
翻译:每年发现的软件漏洞都带来严重的利用风险和系统危害。本文提出一种能够成功识别C代码中缺陷的卷积神经网络模型。我们使用两个互补数据集训练模型:由Draper实验室使用三种静态分析器创建的机器标注数据集,以及专为测试静态分析器设计的NIST SATE Juliet人工标注数据集。与Russell等人基于这些数据集的研究不同,我们专注于C程序,从而能够针对该语言专门优化检测技术。在去除数据集中的重复项后,我们将输入标记化为91种标记类别。为节省内存,将类别值转换为二进制向量。我们选择的首个卷积层确保标记的完整编码能够呈现给滤波器。通过两个卷积层和池化层,以及两个全连接层,将程序分类为通用缺陷枚举类别或"清洁"状态。在要求高精度的前提下,我们在该数据集上获得了比Russell等人先前研究更高的召回率。通过在自定义Linux内核数据集上的实验,我们证明了该模型能够以较低误报率在复杂代码中发现真实漏洞。