In the computer vision and machine learning communities, as well as in many other research domains, rigorous evaluation of any new method, including classifiers, is essential. One key component of the evaluation process is the ability to compare and rank methods. However, ranking classifiers and accurately comparing their performances, especially when taking application-specific preferences into account, remains challenging. For instance, commonly used evaluation tools like Receiver Operating Characteristic (ROC) and Precision/Recall (PR) spaces display performances based on two scores. Hence, they are inherently limited in their ability to compare classifiers across a broader range of scores and lack the capability to establish a clear ranking among classifiers. In this paper, we present a novel versatile tool, named the Tile, that organizes an infinity of ranking scores in a single 2D map for two-class classifiers, including common evaluation scores such as the accuracy, the true positive rate, the positive predictive value, Jaccard's coefficient, and all F-beta scores. Furthermore, we study the properties of the underlying ranking scores, such as the influence of the priors or the correspondences with the ROC space, and depict how to characterize any other score by comparing them to the Tile. Overall, we demonstrate that the Tile is a powerful tool that effectively captures all the rankings in a single visualization and allows interpreting them.
翻译:在计算机视觉与机器学习领域以及众多其他研究领域,对包括分类器在内的新方法进行严谨评估至关重要。评估过程的一个关键环节在于能够比较和排序不同方法。然而,对分类器进行排序并准确比较其性能——尤其是在考虑具体应用偏好时——仍然具有挑战性。例如,常用的评估工具如接收者操作特征(ROC)曲线和精确率/召回率(PR)空间仅基于两个得分展示性能,因此本质上难以在更广泛的得分范围内比较分类器,也无法建立清晰的分类器排序。本文提出了一种名为“瓦片”的新型通用工具,该工具将二分类器的无限种排序得分组织在单一二维映射中,涵盖常见评估指标如准确率、真正例率、阳性预测值、Jaccard系数及所有F-beta分数。此外,我们研究了基础排序得分的性质,如先验概率的影响、与ROC空间的对应关系,并阐述了如何通过与瓦片的比较来表征其他任意得分。总体而言,我们证明瓦片是一种强大的工具,能有效在单一可视化中捕捉所有排序结果并支持对其的解读。