Limits of Machine Learning for Automatic Vulnerability Detection

Recent results of machine learning for automatic vulnerability detection have been very promising indeed: Given only the source code of a function $f$, models trained by machine learning techniques can decide if $f$ contains a security flaw with up to 70% accuracy. But how do we know that these results are general and not specific to the datasets? To study this question, researchers proposed to amplify the testing set by injecting semantic preserving changes and found that the model's accuracy significantly drops. In other words, the model uses some unrelated features during classification. In order to increase the robustness of the model, researchers proposed to train on amplified training data, and indeed model accuracy increased to previous levels. In this paper, we replicate and continue this investigation, and provide an actionable model benchmarking methodology to help researchers better evaluate advances in machine learning for vulnerability detection. Specifically, we propose (i) a cross validation algorithm, where a semantic preserving transformation is applied during the amplification of either the training set or the testing set, and (ii) the amplification of the testing set with code snippets where the vulnerabilities are fixed. Using 11 transformations, 3 ML techniques, and 2 datasets, we find that the improved robustness only applies to the specific transformations used during training data amplification. In other words, the robustified models still rely on unrelated features for predicting the vulnerabilities in the testing data. Additionally, we find that the trained models are unable to generalize to the modified setting which requires to distinguish vulnerable functions from their patches.

翻译：近期，机器学习在自动漏洞检测方面取得了令人瞩目的成果：仅需给定函数 $f$ 的源代码，通过机器学习技术训练的模型即可判断该函数是否包含安全缺陷，准确率高达70%。然而，我们如何确保这些结果的普适性，而非特定于所用数据集？为探究此问题，研究者提出通过注入语义保持变换来扩增测试集，结果发现模型准确率显著下降。换言之，模型在分类过程中依赖于某些无关特征。为提升模型鲁棒性，研究者提出在扩增训练数据上进行训练，模型准确率确实恢复至原有水平。本文在此基础上复现并深化研究，提出一套可操作的模型基准测试方法论，以帮助研究者更科学地评估机器学习在漏洞检测领域的进展。具体而言，我们提出：(i) 一种交叉验证算法，在训练集或测试集扩增过程中应用语义保持变换；以及 (ii) 对包含已修复漏洞的代码片段进行测试集扩增。通过运用11种变换、3种机器学习技术和2个数据集，我们发现改进后的鲁棒性仅适用于训练数据扩增时使用的特定变换。换言之，经鲁棒化处理的模型仍依赖于无关特征来预测测试数据中的漏洞。此外，我们还发现训练后的模型无法泛化至需区分脆弱函数与其补丁代码的修改场景。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日