Evaluating classifications is crucial in statistics and machine learning, as it influences decision-making across various fields, such as patient prognosis and therapy in critical conditions. The Matthews correlation coefficient (MCC) is recognized as a performance metric with high reliability, offering a balanced measurement even in the presence of class imbalances. Despite its importance, there remains a notable lack of comprehensive research on the statistical inference of MCC. This deficiency often leads to studies merely validating and comparing MCC point estimates, a practice that, while common, overlooks the statistical significance and reliability of results. Addressing this research gap, our paper introduces and evaluates several methods to construct asymptotic confidence intervals for the single MCC and the differences between MCCs in paired designs. Through simulations across various scenarios, we evaluate the finite-sample behavior of these methods and compare their performances. Furthermore, through real data analysis, we illustrate the potential utility of our findings in comparing binary classifiers, highlighting the possible contributions of our research in this field.
翻译:在统计学与机器学习中,分类评估至关重要,因其影响着诸如危重病患的预后与治疗方案等多个领域的决策。马修斯相关系数(MCC)被公认为一种高可靠性的性能度量指标,即使在类别不平衡的情况下也能提供均衡的测量。尽管其重要性显著,目前对MCC统计推断的全面研究仍明显不足。这一缺陷常导致研究仅停留在验证和比较MCC点估计值的层面,这种做法虽属常见,却忽视了结果的统计显著性与可靠性。为填补这一研究空白,本文提出并评估了为单一MCC以及配对设计中MCC间差异构建渐近置信区间的多种方法。通过多种场景下的模拟实验,我们评估了这些方法在有限样本下的表现并比较了其性能。此外,通过真实数据分析,我们展示了本研究结果在比较二分类器方面的潜在应用价值,突显了本研究成果在该领域可能作出的贡献。