Selective Classification, wherein models can reject low-confidence predictions, promises reliable translation of machine-learning based classification systems to real-world scenarios such as clinical diagnostics. While current evaluation of these systems typically assumes fixed working points based on pre-defined rejection thresholds, methodological progress requires benchmarking the general performance of systems akin to the $\mathrm{AUROC}$ in standard classification. In this work, we define 5 requirements for multi-threshold metrics in selective classification regarding task alignment, interpretability, and flexibility, and show how current approaches fail to meet them. We propose the Area under the Generalized Risk Coverage curve ($\mathrm{AUGRC}$), which meets all requirements and can be directly interpreted as the average risk of undetected failures. We empirically demonstrate the relevance of $\mathrm{AUGRC}$ on a comprehensive benchmark spanning 6 data sets and 13 confidence scoring functions. We find that the proposed metric substantially changes metric rankings on 5 out of the 6 data sets.
翻译:选择性分类允许模型拒绝低置信度预测,有望将基于机器学习的分类系统可靠地应用于临床诊断等现实场景。尽管当前对这些系统的评估通常基于预定义的拒绝阈值设定固定工作点,但方法学进展需要类似标准分类中$\mathrm{AUROC}$那样对系统整体性能进行基准测试。本文针对任务对齐性、可解释性和灵活性,为选择性分类的多阈值指标定义了五项要求,并揭示了现有方法如何未能满足这些要求。我们提出了广义风险覆盖曲线下面积($\mathrm{AUGRC}$),该指标满足所有要求,且可直接解释为未检出故障的平均风险。我们在涵盖6个数据集和13种置信度评分函数的综合基准测试中实证验证了$\mathrm{AUGRC}$的相关性。研究发现,所提出的指标显著改变了其中5个数据集上的指标排序结果。