ValUES: A Framework for Systematic Validation of Uncertainty Estimation in Semantic Segmentation

Uncertainty estimation is an essential and heavily-studied component for the reliable application of semantic segmentation methods. While various studies exist claiming methodological advances on the one hand, and successful application on the other hand, the field is currently hampered by a gap between theory and practice leaving fundamental questions unanswered: Can data-related and model-related uncertainty really be separated in practice? Which components of an uncertainty method are essential for real-world performance? Which uncertainty method works well for which application? In this work, we link this research gap to a lack of systematic and comprehensive evaluation of uncertainty methods. Specifically, we identify three key pitfalls in current literature and present an evaluation framework that bridges the research gap by providing 1) a controlled environment for studying data ambiguities as well as distribution shifts, 2) systematic ablations of relevant method components, and 3) test-beds for the five predominant uncertainty applications: OoD-detection, active learning, failure detection, calibration, and ambiguity modeling. Empirical results on simulated as well as real-world data demonstrate how the proposed framework is able to answer the predominant questions in the field revealing for instance that 1) separation of uncertainty types works on simulated data but does not necessarily translate to real-world data, 2) aggregation of scores is a crucial but currently neglected component of uncertainty methods, 3) While ensembles are performing most robustly across the different downstream tasks and settings, test-time augmentation often constitutes a light-weight alternative. Code is at: https://github.com/IML-DKFZ/values

翻译：不确定性估计是确保语义分割方法可靠应用的重要组成部分，相关研究已积累了广泛关注。尽管现有研究在方法论进展与实际应用上分别有所宣称，但当前领域存在理论与实践间的鸿沟，导致基础性问题悬而未决：数据相关与模型相关的不确定性在实践中能否真正分离？不确定性方法的哪些组件对现实世界性能至关重要？哪种不确定性方法适用于何种应用场景？本研究将此类研究空白归因于缺乏系统化与全面化的不确定性方法评估。具体而言，我们指出现有文献中的三个关键陷阱，并提出一个弥合研究差距的评估框架，其核心包括：1）用于研究数据歧义性与分布偏移的可控环境，2）相关方法组件的系统性消融实验，以及3）涵盖五项主要不确定性应用（OoD检测、主动学习、故障检测、校准与歧义建模）的测试基准。基于仿真与现实世界数据的实证结果表明，本框架能够回答领域内的核心问题，例如：1）不确定性类型分离在仿真数据中有效，但未必能迁移至现实世界数据；2）分数聚合是关键但当前被忽视的不确定性方法组件；3）尽管集成方法在不同下游任务与设置中表现最为稳健，测试时增强通常可作为轻量级替代方案。代码地址：https://github.com/IML-DKFZ/values