Toward a consistent performance evaluation for defect prediction models

In defect prediction community, many defect prediction models have been proposed and indeed more new models are continuously being developed. However, there is no consensus on how to evaluate the performance of a newly proposed model. In this paper, we aim to propose MATTER, a fraMework towArd a consisTenT pErformance compaRison, which makes model performance directly comparable across different studies. We take three actions to build a consistent evaluation framework for defect prediction models. First, we propose a simple and easy-to-use unsupervised baseline model ONE (glObal baseliNe modEl) to provide "a single point of comparison". Second, we propose using the SQA-effort-aligned threshold setting to make a fair comparison. Third, we suggest reporting the evaluation results in a unified way and provide a set of core performance indicators for this purpose, thus enabling an across-study comparison to attain real progress. The experimental results show that MATTER can serve as an effective framework to support a consistent performance evaluation for defect prediction models and hence can help determine whether a newly proposed defect prediction model is practically useful for practitioners and inform the real progress in the road of defect prediction. Furthermore, when applying MATTER to evaluate the representative defect prediction models proposed in recent years, we find that most of them (if not all) are not superior to the simple baseline model ONE in terms of the SQA-effort awareness prediction performance. This reveals that the real progress in defect prediction has been overestimated. We hence recommend that, in future studies, when any new defect prediction model is proposed, MATTER should be used to evaluate its actual usefulness (on the same benchmark test data sets) to advance scientific progress in defect prediction.

翻译：在缺陷预测领域，已提出众多缺陷预测模型，且新模型持续涌现。然而，对于如何评估新模型的性能尚未形成共识。本文旨在提出MATTER（面向一致性性能比较的框架），使不同研究中的模型性能具有直接可比性。我们通过三项举措构建缺陷预测模型的一致性评估框架：首先，提出简单易用的无监督基线模型ONE（全局基线模型）作为“单一比较基准”；其次，采用SQA工作量对齐的阈值设定方法实现公平比较；第三，建议以统一方式报告评估结果，并为此提供一组核心性能指标，从而实现跨研究比较以获取真实进展。实验结果表明，MATTER能有效支撑缺陷预测模型的一致性性能评估，进而帮助判断新提出的缺陷预测模型对实践者是否具有实际效用，并揭示缺陷预测研究道路上的真实进展。此外，当应用MATTER评估近年提出的代表性缺陷预测模型时，我们发现大多数（若非全部）模型在SQA工作量感知预测性能方面并未优于简单基线模型ONE。这表明缺陷预测领域的真实进展被高估了。因此我们建议，未来研究在提出任何新缺陷预测模型时，应使用MATTER（在相同基准测试数据集上）评估其实际效用，以推动缺陷预测的科学进展。