Evaluating classifications is crucial in statistics and machine learning, as it influences decision-making across various fields, such as patient prognosis and therapy in critical conditions. The Matthews correlation coefficient (MCC) is recognized as a performance metric with high reliability, offering a balanced measurement even in the presence of class imbalances. Despite its importance, there remains a notable lack of comprehensive research on the statistical inference of MCC. This deficiency often leads to studies merely validating and comparing MCC point estimates, a practice that, while common, overlooks the statistical significance and reliability of results. Addressing this research gap, our paper introduces and evaluates several methods to construct asymptotic confidence intervals for the single MCC and the differences between MCCs in paired designs. Through simulations across various scenarios, we evaluate the finite-sample behavior of these methods and compare their performances. Furthermore, through real data analysis, we illustrate the potential utility of our findings in comparing binary classifiers, highlighting the possible contributions of our research in this field.
翻译:分类评估在统计学和机器学习中至关重要,因为它影响着各个领域的决策制定,例如危重病情下的患者预后与治疗方案。马修斯相关系数(MCC)被认为是一种高可靠性的性能指标,即使在类别不平衡的情况下也能提供均衡的度量。尽管其重要性显著,但关于MCC统计推断的综合研究仍明显不足。这一缺陷常导致研究仅停留在验证和比较MCC的点估计上,这种做法虽普遍,却忽视了结果的统计显著性与可靠性。为填补这一研究空白,本文提出并评估了多种构建单个MCC及配对设计中MCC差异渐近置信区间的方法。通过多种场景下的模拟实验,我们评估了这些方法的有限样本表现并比较了其性能。此外,通过真实数据分析,我们展示了研究发现在比较二元分类器方面的潜在应用价值,凸显了本研究在该领域的可能贡献。