Toward a consistent performance evaluation for defect prediction models

In defect prediction community, many defect prediction models have been proposed and indeed more new models are continuously being developed. However, there is no consensus on how to evaluate the performance of a newly proposed model. In this paper, we aim to propose MATTER, a fraMework towArd a consisTenT pErformance compaRison, which makes model performance directly comparable across different studies. We take three actions to build a consistent evaluation framework for defect prediction models. First, we propose a simple and easy-to-use unsupervised baseline model ONE (glObal baseliNe modEl) to provide "a single point of comparison". Second, we propose using the SQA-effort-aligned threshold setting to make a fair comparison. Third, we suggest reporting the evaluation results in a unified way and provide a set of core performance indicators for this purpose, thus enabling an across-study comparison to attain real progress. The experimental results show that MATTER can serve as an effective framework to support a consistent performance evaluation for defect prediction models and hence can help determine whether a newly proposed defect prediction model is practically useful for practitioners and inform the real progress in the road of defect prediction. Furthermore, when applying MATTER to evaluate the representative defect prediction models proposed in recent years, we find that most of them (if not all) are not superior to the simple baseline model ONE in terms of the SQA-effort awareness prediction performance. This reveals that the real progress in defect prediction has been overestimated. We hence recommend that, in future studies, when any new defect prediction model is proposed, MATTER should be used to evaluate its actual usefulness (on the same benchmark test data sets) to advance scientific progress in defect prediction.

翻译：在缺陷预测研究领域，已有大量缺陷预测模型被提出，且新型模型仍在持续涌现。然而，学界尚未形成评估新模型性能的共识方法。本文旨在提出MATTER框架——一种实现一致性能比较的评估体系，使不同研究中的模型性能具备直接可比性。我们通过三项举措构建缺陷预测模型的统一评估框架：首先，提出简单易用的无监督基线模型ONE（全局基准模型），提供“单一比较基准点”；其次，采用SQA工作量对齐阈值设定方法实现公平比较；最后，建议以统一方式报告评估结果，并为此提供一组核心性能指标，从而支持跨研究比较以取得实质进展。实验结果表明，MATTER可作为有效框架支持缺陷预测模型的一致性性能评估，有助于判断新提出的缺陷预测模型对实践者是否具有实际效用，并揭示缺陷预测研究道路上的真实进展。进一步地，运用MATTER评估近年来代表性缺陷预测模型时发现，绝大多数（若非全部）在SQA工作量感知预测性能方面均未超越简单基线模型ONE。这说明缺陷预测领域的真实进展被高估了。因此我们建议：未来研究中提出任何新型缺陷预测模型时，应采用MATTER框架（在相同基准测试数据集上）评估其实际有效性，以推动缺陷预测的科学进步。