Defect prediction is crucial for software quality assurance and has been extensively researched over recent decades. However, prior studies rarely focus on data complexity in defect prediction tasks, and even less on understanding the difficulties of these tasks from the perspective of data complexity. In this paper, we conduct an empirical study to estimate the hardness of over 33,000 instances, employing a set of measures to characterize the inherent difficulty of instances and the characteristics of defect datasets. Our findings indicate that: (1) instance hardness in both classes displays a right-skewed distribution, with the defective class exhibiting a more scattered distribution; (2) class overlap is the primary factor influencing instance hardness and can be characterized through feature, structural, instance, and multiresolution overlap; (3) no universal preprocessing technique is applicable to all datasets, and it may not consistently reduce data complexity, fortunately, dataset complexity measures can help identify suitable techniques for specific datasets; (4) integrating data complexity information into the learning process can enhance an algorithm's learning capacity. In summary, this empirical study highlights the crucial role of data complexity in defect prediction tasks, and provides a novel perspective for advancing research in defect prediction techniques.
翻译:缺陷预测对于软件质量保证至关重要,近几十年来已被广泛研究。然而,以往研究很少关注缺陷预测任务中的数据复杂性,更少从数据复杂性角度理解这些任务的难度。本文通过实证研究,采用一组度量指标来表征实例的内在难度和缺陷数据集的特征,对超过33,000个实例的难易程度进行了评估。研究结果表明:(1)两个类别中的实例难度均呈现右偏分布,其中缺陷类别的分布更为分散;(2)类别重叠是影响实例难度的主要因素,可通过特征重叠、结构重叠、实例重叠和多分辨率重叠来表征;(3)不存在适用于所有数据集的通用预处理技术,且该技术未必能持续降低数据复杂性,幸运的是,数据集复杂性度量有助于为特定数据集识别合适的技术;(4)将数据复杂性信息融入学习过程可增强算法的学习能力。总之,本实证研究凸显了数据复杂性在缺陷预测任务中的关键作用,并为推进缺陷预测技术研究提供了新颖视角。