Selecting an evaluation metric is fundamental to model development, but uncertainty remains about when certain metrics are preferable and why. This paper introduces the concept of resolving power to describe the ability of an evaluation metric to distinguish between binary classifiers of similar quality. This ability depends on two attributes: 1. The metric's response to improvements in classifier quality (its signal), and 2. The metric's sampling variability (its noise). The paper defines resolving power generically as a metric's sampling uncertainty scaled by its signal. The primary application of resolving power is to assess threshold-free evaluation metrics, such as the area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC). A simulation study compares the AUROC and the AUPRC in a variety of contexts. It finds that the AUROC generally has greater resolving power, but that the AUPRC is better when searching among high-quality classifiers applied to low prevalence outcomes. The paper concludes by proposing an empirical method to estimate resolving power that can be applied to any dataset and any initial classification model.
翻译:选择评估指标是模型开发的基础,但关于何时何种指标更优及其原因仍存在不确定性。本文引入“分辨率能力”概念,用于描述评估指标区分质量相近的二元分类器的能力。该能力取决于两个属性:1)指标对分类器质量提升的响应(其信号);2)指标的采样变异性(其噪声)。本文将分辨率能力统一定义为指标采样不确定性经其信号缩放后的量。分辨率能力的主要应用场景是评估无阈值指标,如受试者工作特征曲线下面积(AUROC)和精确率-召回率曲线下面积(AUPRC)。通过模拟研究,本文在多种情境下比较了AUROC和AUPRC,发现AUROC通常具有更强的分辨能力,但在针对低患病率结果的优质分类器搜索中,AUPRC表现更优。最后,本文提出一种适用于任意数据集和初始分类模型的分辨能力经验估计方法。