ActiveClean: Generating Line-Level Vulnerability Data via Active Learning

Deep learning vulnerability detection tools are increasing in popularity and have been shown to be effective. These tools rely on large volume of high quality training data, which are very hard to get. Most of the currently available datasets provide function-level labels, reporting whether a function is vulnerable or not vulnerable. However, for a vulnerability detection to be useful, we need to also know the lines that are relevant to the vulnerability. This paper makes efforts towards developing systematic tools and proposes. ActiveClean to generate the large volume of line-level vulnerability data from commits. That is, in addition to function-level labels, it also reports which lines in the function are likely responsible for vulnerability detection. In the past, static analysis has been applied to clean commits to generate line-level data. Our approach based on active learning, which is easy to use and scalable, provide a complementary approach to static analysis. We designed semantic and syntactic properties from commit lines and use them to train the model. We evaluated our approach on both Java and C datasets processing more than 4.3K commits and 119K commit lines. AcitveClean achieved an F1 score between 70-74. Further, we also show that active learning is effective by using just 400 training data to reach F1 score of 70.23. Using ActiveClean, we generate the line-level labels for the entire FFMpeg project in the Devign dataset, including 5K functions, and also detected incorrect function-level labels. We demonstrated that using our cleaned data, LineVul, a SOTA line-level vulnerability detection tool, detected 70 more vulnerable lines and 18 more vulnerable functions, and improved Top 10 accuracy from 66% to 73%.

翻译：基于深度学习的漏洞检测工具日益流行且已被证明有效。这类工具依赖于大规模高质量的训练数据，而这类数据极难获取。目前大多数可用数据集仅提供函数级标签，用于标注函数是否存在漏洞。然而，要使漏洞检测具备实用价值，还需要定位与漏洞相关的代码行。本文致力于开发系统化工具，提出ActiveClean方法，从代码提交中生成大规模行级漏洞数据。该方法不仅提供函数级标签，还能指出函数中哪些代码行可能导致了漏洞。过去常采用静态分析清理代码提交以生成行级数据，而本文基于主动学习的方法具有易用性和可扩展性，可作为静态分析的补充方案。我们从代码提交行中提取语义和语法特征用于模型训练，并在Java和C语言数据集上进行了评估，处理了超过4,300次提交和119,000行提交代码。ActiveClean的F1分数达到70-74%。此外，仅使用400个训练样本，主动学习即达到70.23%的F1分数，证明了其有效性。通过ActiveClean，我们为Devign数据集中包含5,000个函数的整个FFMpeg项目生成了行级标签，并纠正了函数级标签的错误。实验表明，使用经ActiveClean清洗的数据后，SOTA行级漏洞检测工具LineVul额外检测到70个漏洞行和18个漏洞函数，Top-10准确率从66%提升至73%。

相关内容

TOOLS

关注 1

这个新版本的工具会议系列恢复了从1989年到2012年的50个会议的传统。工具最初是“面向对象语言和系统的技术”，后来发展到包括软件技术的所有创新方面。今天许多最重要的软件概念都是在这里首次引入的。2019年TOOLS 50+1在俄罗斯喀山附近举行，以同样的创新精神、对所有与软件相关的事物的热情、科学稳健性和行业适用性的结合以及欢迎该领域所有趋势和社区的开放态度，延续了该系列。官网链接：http://tools2019.innopolis.ru/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日