Uncovering the Limits of Machine Learning for Automatic Vulnerability Detection

Recent results of machine learning for automatic vulnerability detection (ML4VD) have been very promising. Given only the source code of a function $f$, ML4VD techniques can decide if $f$ contains a security flaw with up to 70% accuracy. However, as evident in our own experiments, the same top-performing models are unable to distinguish between functions that contain a vulnerability and functions where the vulnerability is patched. So, how can we explain this contradiction and how can we improve the way we evaluate ML4VD techniques to get a better picture of their actual capabilities? In this paper, we identify overfitting to unrelated features and out-of-distribution generalization as two problems, which are not captured by the traditional approach of evaluating ML4VD techniques. As a remedy, we propose a novel benchmarking methodology to help researchers better evaluate the true capabilities and limits of ML4VD techniques. Specifically, we propose (i) to augment the training and validation dataset according to our cross-validation algorithm, where a semantic preserving transformation is applied during the augmentation of either the training set or the testing set, and (ii) to augment the testing set with code snippets where the vulnerabilities are patched. Using six ML4VD techniques and two datasets, we find (a) that state-of-the-art models severely overfit to unrelated features for predicting the vulnerabilities in the testing data, (b) that the performance gained by data augmentation does not generalize beyond the specific augmentations applied during training, and (c) that state-of-the-art ML4VD techniques are unable to distinguish vulnerable functions from their patches.

翻译：机器学习在自动漏洞检测（ML4VD）领域的最新成果令人振奋。仅凭函数$f$的源代码，ML4VD技术就能以高达70%的准确率判定该函数是否包含安全缺陷。然而，正如我们自己的实验所揭示，这些同样表现优异的模型却无法区分包含漏洞的函数与已修复漏洞的函数。那么，如何解释这一矛盾？又该如何改进ML4VD技术的评估方式，以更全面地了解其真实能力？本文指出，对无关特征的过拟合与分布外泛化能力不足是传统ML4VD技术评估方法未能捕捉的两个问题。为此，我们提出一种新颖的基准测试方法，旨在帮助研究人员更好地评估ML4VD技术的真实能力与局限性。具体而言，我们建议：(i) 根据我们的交叉验证算法扩充训练集和验证集，在扩充过程中对训练集或测试集应用语义保持变换；(ii) 在测试集中加入包含已修复漏洞的代码片段。通过六种ML4VD技术和两个数据集，我们发现：(a) 最先进的模型严重过拟合于测试数据中预测漏洞的无关特征；(b) 数据增强带来的性能提升无法泛化到训练过程中未使用的特定增强方式之外；(c) 最先进的ML4VD技术无法区分易受攻击的函数及其补丁。