Empirical Optimal Risk to Quantify Model Trustworthiness for Failure Detection

Failure detection (FD) in AI systems is a crucial safeguard for the deployment for safety-critical tasks. The common evaluation method of FD performance is the Risk-coverage (RC) curve, which reveals the trade-off between the data coverage rate and the performance on accepted data. One common way to quantify the RC curve by calculating the area under the RC curve. However, this metric does not inform on how suited any method is for FD, or what the optimal coverage rate should be. As FD aims to achieve higher performance with fewer data discarded, evaluating with partial coverage excluding the most uncertain samples is more intuitive and meaningful than full coverage. In addition, there is an optimal point in the coverage where the model could achieve ideal performance theoretically. We propose the Excess Area Under the Optimal RC Curve (E-AUoptRC), with the area in coverage from the optimal point to the full coverage. Further, the model performance at this optimal point can represent both model learning ability and calibration. We propose it as the Trust Index (TI), a complementary evaluation metric to the overall model accuracy. We report extensive experiments on three benchmark image datasets with ten variants of transformer and CNN models. Our results show that our proposed methods can better reflect the model trustworthiness than existing evaluation metrics. We further observe that the model with high overall accuracy does not always yield the high TI, which indicates the necessity of the proposed Trust Index as a complementary metric to the model overall accuracy. The code are available at \url{https://github.com/AoShuang92/optimal_risk}.

翻译：人工智能系统中的失效检测（FD）是安全关键任务部署的重要保障。FD性能的常用评估方法是风险覆盖率（RC）曲线，该曲线揭示了数据覆盖率和接受数据性能之间的权衡。量化RC曲线的一种常见方法是计算其曲线下面积。然而，该指标无法说明任何方法对FD的适用性，也无法确定最优覆盖率应为多少。由于FD旨在以更少的数据丢弃实现更高性能，排除最不确定样本后的部分覆盖率评估比全覆盖率更具直观性和意义。此外，存在一个理论上可使模型达到理想性能的最优覆盖点。我们提出了最优RC曲线下超额面积（E-AUoptRC），其覆盖范围从最优点到全覆盖率。进一步地，该最优点的模型性能可同时表征模型的学习能力和校准能力。我们将其定义为可信指数（TI），作为模型整体精度的补充评估指标。我们在三个基准图像数据集上进行了广泛实验，包含十种Transformer和CNN模型变体。结果表明，我们提出的方法比现有评估指标更能反映模型可信度。我们进一步观察到，高整体精度的模型并不总具有高TI，这证明了提出可信指数作为模型整体精度补充指标的必要性。代码见 \url{https://github.com/AoShuang92/optimal_risk}。