Defect prediction has been a popular research topic where machine learning (ML) and deep learning (DL) have found numerous applications. However, these ML/DL-based defect prediction models are often limited by the quality and size of their datasets. In this paper, we present Defectors, a large dataset for just-in-time and line-level defect prediction. Defectors consists of $\approx$ 213K source code files ($\approx$ 93K defective and $\approx$ 120K defect-free) that span across 24 popular Python projects. These projects come from 18 different domains, including machine learning, automation, and internet-of-things. Such a scale and diversity make Defectors a suitable dataset for training ML/DL models, especially transformer models that require large and diverse datasets. We also foresee several application areas of our dataset including defect prediction and defect explanation. Dataset link: https://doi.org/10.5281/zenodo.7708984
翻译:缺陷预测一直是一个热门的研究课题,其中机器学习(ML)和深度学习(DL)得到了广泛应用。然而,基于ML/DL的缺陷预测模型常常受到数据集质量和规模的限制。本文提出了Defectors,一个用于即时和行级缺陷预测的大规模数据集。Defectors包含约21.3万个源代码文件(约9.3万个有缺陷文件,约12万个无缺陷文件),涵盖24个流行的Python项目。这些项目来自18个不同领域,包括机器学习、自动化和物联网。如此规模和多样性使得Defectors成为训练ML/DL模型(特别是需要大规模多样化数据集的Transformer模型)的合适数据集。我们还预见了该数据集在缺陷预测和缺陷解释等领域的多个应用场景。数据集链接:https://doi.org/10.5281/zenodo.7708984