Software plays a crucial role in our daily lives, and therefore the quality and security of software systems have become increasingly important. However, vulnerabilities in software still pose a significant threat, as they can have serious consequences. Recent advances in automated program repair have sought to automatically detect and fix bugs using data-driven techniques. Sophisticated deep learning methods have been applied to this area and have achieved promising results. However, existing benchmarks for training and evaluating these techniques remain limited, as they tend to focus on a single programming language and have relatively small datasets. Moreover, many benchmarks tend to be outdated and lack diversity, focusing on a specific codebase. Worse still, the quality of bug explanations in existing datasets is low, as they typically use imprecise and uninformative commit messages as explanations. To address these issues, we propose an automated collecting framework REEF to collect REal-world vulnErabilities and Fixes from open-source repositories. We develop a multi-language crawler to collect vulnerabilities and their fixes, and design metrics to filter for high-quality vulnerability-fix pairs. Furthermore, we propose a neural language model-based approach to generate high-quality vulnerability explanations, which is key to producing informative fix messages. Through extensive experiments, we demonstrate that our approach can collect high-quality vulnerability-fix pairs and generate strong explanations. The dataset we collect contains 4,466 CVEs with 30,987 patches (including 236 CWE) across 7 programming languages with detailed related information, which is superior to existing benchmarks in scale, coverage, and quality. Evaluations by human experts further confirm that our framework produces high-quality vulnerability explanations.
翻译:软件在我们的日常生活中扮演着关键角色,因此软件系统的质量和安全性变得日益重要。然而,软件漏洞仍然构成重大威胁,可能引发严重后果。近期自动化程序修复领域的进展,试图利用数据驱动技术自动检测和修复缺陷。先进的深度学习方法已被应用于该领域,并取得了令人瞩目的成果。然而,现有的用于训练和评估这些技术的基准数据集仍存在局限性,它们通常聚焦于单一编程语言且数据集规模较小。更糟的是,许多基准数据集往往过时且缺乏多样性,仅针对特定代码库。此外,现有数据集中漏洞解释的质量较低,通常使用不精确且信息量匮乏的提交信息作为解释。为解决这些问题,我们提出一种自动化收集框架REEF,用于从开源仓库中收集真实世界的漏洞与修复。我们开发了一个多语言爬虫程序来收集漏洞及其修复,并设计了指标来筛选高质量的漏洞-修复对。更进一步,我们提出一种基于神经语言模型的方法来生成高质量的漏洞解释,这是生成富有信息量的修复信息的关键。通过大量实验,我们证明本方法能够收集高质量的漏洞-修复对并生成强有力的解释。我们收集的数据集包含7种编程语言的4,466个CVE(含236个CWE类别)及其30,987个补丁,并附带详细相关信息,在规模、覆盖范围和质量方面均优于现有基准数据集。人类专家的评估进一步证实了本框架能够生成高质量的漏洞解释。