There exist a wide range of single number metrics for assessing performance of classification algorithms, including AUC and the F1-score (Wikipedia lists 17 such metrics, with 27 different names). In this article, I propose a new metric to answer the following question: when an algorithm is tuned so that it can no longer distinguish labelled cats from real cats, how often does a randomly chosen image that has been labelled as containing a cat actually contain a cat? The steps to construct this metric are as follows. First, we set a threshold score such that when the algorithm is shown two randomly-chosen images -- one that has a score greater than the threshold (i.e. a picture labelled as containing a cat) and another from those pictures that really does contain a cat -- the probability that the image with the highest score is the one chosen from the set of real cat images is 50\%. At this decision threshold, the set of positively labelled images are indistinguishable from the set of images which are positive. Then, as a second step, we measure performance by asking how often a randomly chosen picture from those labelled as containing a cat actually contains a cat. This metric can be thought of as {\it precision at the indistinguishability threshold}. While this new metric doesn't address the tradeoff between precision and recall inherent to all such metrics, I do show why this method avoids pitfalls that can occur when using, for example AUC, and it is better motivated than, for example, the F1-score.
翻译:存在多种用于评估分类算法性能的单一数值指标,包括AUC和F1分数(维基百科列出17种此类指标,共有27个不同名称)。本文提出一种新指标以回答以下问题:当算法被调参至无法区分标注的猫与真实的猫时,从标注包含猫的图像中随机选取一张,其实际包含猫的概率是多少?构建该指标的步骤如下:首先,设置一个决策阈值,使得当算法面对两张随机选取的图像(一张得分高于阈值即标注为包含猫,另一张来自真实猫图像集合)时,得分最高图像来自真实猫集合的概率为50%。在此决策阈值下,正向标注的图像与正向图像集合不可区分。其次,通过计算从标注包含猫的图像中随机选取一张实际包含猫的频率来度量性能。该指标可视为“不可区分阈值下的精确率”。虽然这一新指标无法解决所有此类指标固有的精确率与召回率的权衡问题,但本文说明了该方法如何规避使用AUC等指标时可能出现的陷阱,且其动机优于F1分数等指标。