REEF: A Framework for Collecting Real-World Vulnerabilities and Fixes

Software plays a crucial role in our daily lives, and therefore the quality and security of software systems have become increasingly important. However, vulnerabilities in software still pose a significant threat, as they can have serious consequences. Recent advances in automated program repair have sought to automatically detect and fix bugs using data-driven techniques. Sophisticated deep learning methods have been applied to this area and have achieved promising results. However, existing benchmarks for training and evaluating these techniques remain limited, as they tend to focus on a single programming language and have relatively small datasets. Moreover, many benchmarks tend to be outdated and lack diversity, focusing on a specific codebase. Worse still, the quality of bug explanations in existing datasets is low, as they typically use imprecise and uninformative commit messages as explanations. To address these issues, we propose an automated collecting framework REEF to collect REal-world vulnErabilities and Fixes from open-source repositories. We develop a multi-language crawler to collect vulnerabilities and their fixes, and design metrics to filter for high-quality vulnerability-fix pairs. Furthermore, we propose a neural language model-based approach to generate high-quality vulnerability explanations, which is key to producing informative fix messages. Through extensive experiments, we demonstrate that our approach can collect high-quality vulnerability-fix pairs and generate strong explanations. The dataset we collect contains 4,466 CVEs with 30,987 patches (including 236 CWE) across 7 programming languages with detailed related information, which is superior to existing benchmarks in scale, coverage, and quality. Evaluations by human experts further confirm that our framework produces high-quality vulnerability explanations.

翻译：软件在我们的日常生活中扮演着关键角色，因此软件系统的质量和安全性变得日益重要。然而，软件漏洞仍然构成重大威胁，可能引发严重后果。近期自动化程序修复领域的进展，试图利用数据驱动技术自动检测和修复缺陷。先进的深度学习方法已被应用于该领域，并取得了令人瞩目的成果。然而，现有的用于训练和评估这些技术的基准数据集仍存在局限性，它们通常聚焦于单一编程语言且数据集规模较小。更糟的是，许多基准数据集往往过时且缺乏多样性，仅针对特定代码库。此外，现有数据集中漏洞解释的质量较低，通常使用不精确且信息量匮乏的提交信息作为解释。为解决这些问题，我们提出一种自动化收集框架REEF，用于从开源仓库中收集真实世界的漏洞与修复。我们开发了一个多语言爬虫程序来收集漏洞及其修复，并设计了指标来筛选高质量的漏洞-修复对。更进一步，我们提出一种基于神经语言模型的方法来生成高质量的漏洞解释，这是生成富有信息量的修复信息的关键。通过大量实验，我们证明本方法能够收集高质量的漏洞-修复对并生成强有力的解释。我们收集的数据集包含7种编程语言的4,466个CVE（含236个CWE类别）及其30,987个补丁，并附带详细相关信息，在规模、覆盖范围和质量方面均优于现有基准数据集。人类专家的评估进一步证实了本框架能够生成高质量的漏洞解释。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日