Analysis and Comparison of Classification Metrics

A variety of different performance metrics are commonly used in the machine learning literature for the evaluation of classification systems. Some of the most common ones for measuring quality of hard decisions are standard and balanced accuracy, standard and balanced error rate, F-beta score, and Matthews correlation coefficient (MCC). In this document, we review the definition of these and other metrics and compare them with the expected cost (EC), a metric introduced in every statistical learning course but rarely used in the machine learning literature. We show that both the standard and balanced error rates are special cases of the EC. Further, we show its relation with F-score and MCC and argue that EC is superior to these traditional metrics, being more elegant, general, and intuitive, as well as being based on basic principles from statistics. The metrics above measure the quality of hard decisions. Yet, most modern classification systems output continuous scores for the classes which we may want to evaluate directly. Metrics for measuring the quality of system scores include the area under the ROC curve, equal error rate, cross-entropy, Brier score, and Bayes EC or Bayes risk, among others. The last three metrics are special cases of a family of metrics given by the expected value of proper scoring rules (PSRs). We review the theory behind these metrics and argue that they are the most principled way to measure the quality of the posterior probabilities produced by a system. Finally, we show how to use these metrics to compute the system's calibration loss and compare this metric with the standard expected calibration error (ECE), arguing that calibration loss based on PSRs is superior to the ECE for a variety of reasons.

翻译：在机器学习文献中，常用多种不同的性能指标来评估分类系统。衡量硬决策质量最常见的指标包括标准准确率与平衡准确率、标准误差率与平衡误差率、F-β分数以及马修斯相关系数（MCC）。本文回顾了这些指标及其他相关指标的定义，并将其与期望代价（EC）——这一在统计学习课程中引入但鲜见于机器学习文献的指标——进行比较。我们证明标准误差率和平衡误差率均为EC的特例，进一步阐明EC与F分数及MCC的关系，并论证EC在优雅性、通用性、直观性方面优于传统指标，且其基础源于统计学基本原理。上述指标用于衡量硬决策质量，然而现代分类系统通常输出针对各类别的连续得分，我们可能希望直接评估这些得分。衡量系统得分质量的指标包括ROC曲线下面积、等错误率、交叉熵、布里尔分数以及贝叶斯EC（或贝叶斯风险）等。其中后三项指标属于由适当评分规则（PSR）期望值定义的指标族特例。我们回顾了这些指标背后的理论，论证其是衡量系统生成后验概率质量的最具原则性的方法。最后，我们展示如何利用这些指标计算系统的校准损失，并将其与标准期望校准误差（ECE）进行对比，从多角度论证基于PSR的校准损失优于ECE。