Despite the increasing demand for safer machine learning practices, the use of Uncertainty Quantification (UQ) methods in production remains limited. This limitation is exacerbated by the challenge of validating UQ methods in absence of UQ ground truth. In classification tasks, when only a usual set of test data is at hand, several authors suggested different metrics that can be computed from such test points while assessing the quality of quantified uncertainties. This paper investigates such metrics and proves that they are theoretically well-behaved and actually tied to some uncertainty ground truth which is easily interpretable in terms of model prediction trustworthiness ranking. Equipped with those new results, and given the applicability of those metrics in the usual supervised paradigm, we argue that our contributions will help promoting a broader use of UQ in deep learning.
翻译:尽管对更安全的机器学习实践的需求日益增长,不确定性量化方法在生产环境中的应用仍然有限。在缺乏不确定性真实标签的情况下验证UQ方法的挑战进一步加剧了这一局限性。在分类任务中,当仅掌握常规测试数据集时,多位学者提出了可从这类测试点计算的不同度量方法,用于评估量化不确定性的质量。本文系统研究了此类度量方法,从理论上证明了它们具有良好的数学性质,并且实际上与某种易于解释的不确定性真实标签相关联——这种关联可通过模型预测可信度排序进行解读。基于这些新发现,并考虑到这些度量方法在常规监督学习范式中的适用性,我们认为本研究将有助于推动不确定性量化在深度学习领域的更广泛应用。