In the computer vision and machine learning communities, as well as in many other research domains, rigorous evaluation of any new method, including classifiers, is essential. One key component of the evaluation process is the ability to compare and rank methods. However, ranking classifiers and accurately comparing their performances, especially when taking application-specific preferences into account, remains challenging. For instance, commonly used evaluation tools like Receiver Operating Characteristic (ROC) and Precision/Recall (PR) spaces display performances based on two scores. Hence, they are inherently limited in their ability to compare classifiers across a broader range of scores and lack the capability to establish a clear ranking among classifiers. In this paper, we present a novel versatile tool, named the Tile, that organizes an infinity of ranking scores in a single 2D map for two-class classifiers, including common evaluation scores such as the accuracy, the true positive rate, the positive predictive value, Jaccard's coefficient, and all F-beta scores. Furthermore, we study the properties of the underlying ranking scores, such as the influence of the priors or the correspondences with the ROC space, and depict how to characterize any other score by comparing them to the Tile. Overall, we demonstrate that the Tile is a powerful tool that effectively captures all the rankings in a single visualization and allows interpreting them.
翻译:在计算机视觉、机器学习以及众多其他研究领域中,对包括分类器在内的任何新方法进行严格评估至关重要。评估过程的一个关键环节是能够比较和排序不同方法。然而,对分类器进行排序并准确比较其性能,尤其是在考虑特定应用偏好时,仍然具有挑战性。例如,常用的评估工具如受试者工作特征曲线(ROC)和精确率/召回率(PR)空间基于两个得分展示性能,因此本质上难以在更广泛的得分范围内比较分类器,并且缺乏在分类器之间建立明确排序的能力。本文提出了一种名为瓦片图的新型通用工具,它将二分类器的无限种排序得分组织在单一的二维映射中,涵盖常见的评估得分,如准确率、真阳性率、阳性预测值、Jaccard系数以及所有F-beta得分。此外,我们研究了基础排序得分的性质,例如先验概率的影响、与ROC空间的对应关系,并描述了如何通过将其他得分与瓦片图进行比较来刻画其特性。总体而言,我们证明瓦片图是一种强大的工具,能够有效地在单一可视化中捕捉所有排序结果并支持对其的解读。