Recent results of machine learning for automatic vulnerability detection have been very promising indeed: Given only the source code of a function $f$, models trained by machine learning techniques can decide if $f$ contains a security flaw with up to 70% accuracy. But how do we know that these results are general and not specific to the datasets? To study this question, researchers proposed to amplify the testing set by injecting semantic preserving changes and found that the model's accuracy significantly drops. In other words, the model uses some unrelated features during classification. In order to increase the robustness of the model, researchers proposed to train on amplified training data, and indeed model accuracy increased to previous levels. In this paper, we replicate and continue this investigation, and provide an actionable model benchmarking methodology to help researchers better evaluate advances in machine learning for vulnerability detection. Specifically, we propose (i) a cross validation algorithm, where a semantic preserving transformation is applied during the amplification of either the training set or the testing set, and (ii) the amplification of the testing set with code snippets where the vulnerabilities are fixed. Using 11 transformations, 3 ML techniques, and 2 datasets, we find that the improved robustness only applies to the specific transformations used during training data amplification. In other words, the robustified models still rely on unrelated features for predicting the vulnerabilities in the testing data. Additionally, we find that the trained models are unable to generalize to the modified setting which requires to distinguish vulnerable functions from their patches.
翻译:近期,机器学习在自动漏洞检测方面取得了令人瞩目的成果:仅需给定函数 $f$ 的源代码,通过机器学习技术训练的模型即可判断该函数是否包含安全缺陷,准确率高达70%。然而,我们如何确保这些结果的普适性,而非特定于所用数据集?为探究此问题,研究者提出通过注入语义保持变换来扩增测试集,结果发现模型准确率显著下降。换言之,模型在分类过程中依赖于某些无关特征。为提升模型鲁棒性,研究者提出在扩增训练数据上进行训练,模型准确率确实恢复至原有水平。本文在此基础上复现并深化研究,提出一套可操作的模型基准测试方法论,以帮助研究者更科学地评估机器学习在漏洞检测领域的进展。具体而言,我们提出:(i) 一种交叉验证算法,在训练集或测试集扩增过程中应用语义保持变换;以及 (ii) 对包含已修复漏洞的代码片段进行测试集扩增。通过运用11种变换、3种机器学习技术和2个数据集,我们发现改进后的鲁棒性仅适用于训练数据扩增时使用的特定变换。换言之,经鲁棒化处理的模型仍依赖于无关特征来预测测试数据中的漏洞。此外,我们还发现训练后的模型无法泛化至需区分脆弱函数与其补丁代码的修改场景。