Software defect datasets, which are collections of software bugs and their associated information, are essential resources for researchers and practitioners in software engineering and beyond. Such datasets facilitate empirical research and enable standardized benchmarking for a wide range of techniques, including fault detection, fault localization, test generation, test prioritization, automated program repair, and emerging areas like agentic AI-based software development. Over the years, numerous software defect datasets with diverse characteristics have been developed, providing rich resources for the community, yet making it increasingly difficult to navigate the landscape. To address this challenge, this article provides a comprehensive survey of 151 software defect datasets. The survey discusses the scope of existing datasets, e.g., regarding the application domain of the buggy software, the types of defects, and the programming languages used. We also examine the construction of these datasets, including the data sources and construction methods employed. Furthermore, we assess the availability and usability of the datasets, validating their availability and examining how defects are presented. To better understand the practical uses of these datasets, we analyze the publications that cite them, revealing that the primary use cases are evaluations of new techniques and empirical research. Based on our comprehensive review of the existing datasets, this paper suggests potential opportunities for future research, including addressing underrepresented kinds of defects, enhancing availability and usability through better dataset organization, and developing more efficient strategies for dataset construction and maintenance. All surveyed datasets and their classifications are available at https://defect-datasets.github.io/.
翻译:软件缺陷数据集作为软件缺陷及其相关信息的集合,是软件工程及相关领域研究人员与从业者的重要资源。此类数据集为实证研究提供支持,并为故障检测、故障定位、测试生成、测试优先级排序、自动程序修复以及基于智能体AI的软件开发等新兴领域在内的广泛技术提供了标准化基准测试的基础。多年来,已开发出众多具有不同特征的软件缺陷数据集,为学界提供了丰富的资源,但也使得该领域的整体图景日益复杂而难以把握。为应对这一挑战,本文对151个软件缺陷数据集进行了全面综述。本综述探讨了现有数据集的范围,例如涉及缺陷软件的应用领域、缺陷类型以及所使用的编程语言。我们还考察了这些数据集的构建方式,包括所采用的数据源与构建方法。此外,我们评估了数据集的可用性与易用性,验证其可访问性并审视缺陷的呈现方式。为更好地理解这些数据集的实际用途,我们分析了引用它们的相关文献,发现其主要应用场景在于新技术的评估与实证研究。基于对现有数据集的全面梳理,本文提出了未来研究的潜在方向,包括关注代表性不足的缺陷类型、通过优化数据集组织提升可用性与易用性,以及开发更高效的数据集构建与维护策略。所有被调研的数据集及其分类信息可在 https://defect-datasets.github.io/ 获取。