We constructed a newly large-scale and comprehensive C/C++ vulnerability dataset named MegaVul by crawling the Common Vulnerabilities and Exposures (CVE) database and CVE-related open-source projects. Specifically, we collected all crawlable descriptive information of the vulnerabilities from the CVE database and extracted all vulnerability-related code changes from 28 Git-based websites. We adopt advanced tools to ensure the extracted code integrality and enrich the code with four different transformed representations. In total, MegaVul contains 17,380 vulnerabilities collected from 992 open-source repositories spanning 169 different vulnerability types disclosed from January 2006 to October 2023. Thus, MegaVul can be used for a variety of software security-related tasks including detecting vulnerabilities and assessing vulnerability severity. All information is stored in the JSON format for easy usage. MegaVul is publicly available on GitHub and will be continuously updated. It can be easily extended to other programming languages.
翻译:我们通过爬取通用漏洞披露(CVE)数据库及CVE相关的开源项目,构建了一个全新的大规模综合性C/C++漏洞数据集,命名为MegaVul。具体而言,我们从CVE数据库收集了所有可爬取的漏洞描述信息,并从28个基于Git的网站中提取了所有与漏洞相关的代码变更。我们采用先进工具确保提取代码的完整性,并通过四种不同的转换表示对代码进行丰富。总体而言,MegaVul包含从992个开源代码库中收集的17,380个漏洞,涵盖2006年1月至2023年10月期间披露的169种不同漏洞类型。因此,MegaVul可用于多种软件安全相关任务,包括漏洞检测和漏洞严重性评估。所有信息均以JSON格式存储以便使用。MegaVul已在GitHub上公开提供并将持续更新。该数据集可轻松扩展至其他编程语言。