Foundations of the Theory of Performance-Based Ranking

Ranking entities such as algorithms, devices, methods, or models based on their performances, while accounting for application-specific preferences, is a challenge. To address this challenge, we establish the foundations of a universal theory for performance-based ranking. First, we introduce a rigorous framework built on top of both the probability and order theories. Our new framework encompasses the elements necessary to (1) manipulate performances as mathematical objects, (2) express which performances are worse than or equivalent to others, (3) model tasks through a variable called satisfaction, (4) consider properties of the evaluation, (5) define scores, and (6) specify application-specific preferences through a variable called importance. On top of this framework, we propose the first axiomatic definition of performance orderings and performance-based rankings. Then, we introduce a universal parametric family of scores, called ranking scores, that can be used to establish rankings satisfying our axioms, while considering application-specific preferences. Finally, we show, in the case of two-class classification, that the family of ranking scores encompasses well-known performance scores, including the accuracy, the true positive rate (recall, sensitivity), the true negative rate (specificity), the positive predictive value (precision), and F1. However, we also show that some other scores commonly used to compare classifiers are unsuitable to derive performance orderings satisfying the axioms. Therefore, this paper provides the computer vision and machine learning communities with a rigorous framework for evaluating and ranking entities.

翻译：基于性能对算法、设备、方法或模型等实体进行排序，同时考虑特定应用偏好，是一个具有挑战性的问题。为应对这一挑战，本文建立了基于性能排序的通用理论基础。首先，我们引入一个建立在概率论和序论之上的严格框架。该新框架包含以下必要要素：(1) 将性能作为数学对象进行处理；(2) 表达哪些性能优于、劣于或等价于其他性能；(3) 通过称为“满意度”的变量对任务进行建模；(4) 考虑评估属性；(5) 定义评分函数；(6) 通过称为“重要性”的变量体现特定应用偏好。在此框架基础上，我们首次提出了性能序关系与基于性能排序的公理化定义。随后，我们引入一个通用的参数化评分函数族（称为排序评分函数），该函数族可用于构建满足我们公理且考虑特定应用偏好的排序。最后，我们以二分类问题为例，证明排序评分函数族涵盖多种经典性能指标，包括准确率、真正例率（召回率、灵敏度）、真负例率（特异度）、阳性预测值（精确率）以及F1分数。然而，我们也证明某些常用于比较分类器的评分函数不适用于推导满足公理的性能序关系。因此，本文为计算机视觉和机器学习领域提供了一个用于评估和排序实体的严格理论框架。