From Bugs to Benchmarks: A Comprehensive Survey of Software Defect Datasets

Software defect datasets, which are collections of software bugs and their associated information, are essential resources for researchers and practitioners in software engineering and beyond. Such datasets facilitate empirical research and enable standardized benchmarking for a wide range of techniques, including fault detection, fault localization, test generation, test prioritization, automated program repair, and emerging areas like agentic AI-based software development. Over the years, numerous software defect datasets with diverse characteristics have been developed, providing rich resources for the community, yet making it increasingly difficult to navigate the landscape. To address this challenge, this article provides a comprehensive survey of 151 software defect datasets. The survey discusses the scope of existing datasets, e.g., regarding the application domain of the buggy software, the types of defects, and the programming languages used. We also examine the construction of these datasets, including the data sources and construction methods employed. Furthermore, we assess the availability and usability of the datasets, validating their availability and examining how defects are presented. To better understand the practical uses of these datasets, we analyze the publications that cite them, revealing that the primary use cases are evaluations of new techniques and empirical research. Based on our comprehensive review of the existing datasets, this paper suggests potential opportunities for future research, including addressing underrepresented kinds of defects, enhancing availability and usability through better dataset organization, and developing more efficient strategies for dataset construction and maintenance. All surveyed datasets and their classifications are available at https://defect-datasets.github.io/.

翻译：软件缺陷数据集作为软件缺陷及其相关信息的集合，是软件工程及相关领域研究人员与从业者的重要资源。此类数据集为实证研究提供支持，并为故障检测、故障定位、测试生成、测试优先级排序、自动程序修复以及基于智能体AI的软件开发等新兴领域在内的广泛技术提供了标准化基准测试的基础。多年来，已开发出众多具有不同特征的软件缺陷数据集，为学界提供了丰富的资源，但也使得该领域的整体图景日益复杂而难以把握。为应对这一挑战，本文对151个软件缺陷数据集进行了全面综述。本综述探讨了现有数据集的范围，例如涉及缺陷软件的应用领域、缺陷类型以及所使用的编程语言。我们还考察了这些数据集的构建方式，包括所采用的数据源与构建方法。此外，我们评估了数据集的可用性与易用性，验证其可访问性并审视缺陷的呈现方式。为更好地理解这些数据集的实际用途，我们分析了引用它们的相关文献，发现其主要应用场景在于新技术的评估与实证研究。基于对现有数据集的全面梳理，本文提出了未来研究的潜在方向，包括关注代表性不足的缺陷类型、通过优化数据集组织提升可用性与易用性，以及开发更高效的数据集构建与维护策略。所有被调研的数据集及其分类信息可在 https://defect-datasets.github.io/ 获取。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

《数据标准管理实践白皮书》，20页pdf，中国信息通信研究院云计算与大数据研究所

专知会员服务

51+阅读 · 2022年5月31日

2022《数据安全治理白皮书 4.0》，219页pdf，中关村网络安全与信息化产业联盟数据安全治理专业委员会发布

专知会员服务

65+阅读 · 2022年5月31日

国家信息中心等《数据安全复合治理与实践白皮书》，67页pdf

专知会员服务

67+阅读 · 2021年12月27日

软件缺陷自动修复技术综述

专知会员服务

14+阅读 · 2021年9月21日