Evaluating AI systems under uncertain ground truth: a case study in dermatology

David Stutz,Ali Taylan Cemgil,Abhijit Guha Roy,Tatiana Matejovicova,Melih Barsbey,Patricia Strachan,Mike Schaekermann,Jan Freyberg,Rajeev Rikhye,Beverly Freeman,Javier Perez Matos,Umesh Telang,Dale R. Webster,Yuan Liu,Greg S. Corrado,Yossi Matias,Pushmeet Kohli,Yun Liu,Arnaud Doucet,Alan Karthikesalingam

For safety, AI systems in health undergo thorough evaluations before deployment, validating their predictions against a ground truth that is assumed certain. However, this is actually not the case and the ground truth may be uncertain. Unfortunately, this is largely ignored in standard evaluation of AI models but can have severe consequences such as overestimating the future performance. To avoid this, we measure the effects of ground truth uncertainty, which we assume decomposes into two main components: annotation uncertainty which stems from the lack of reliable annotations, and inherent uncertainty due to limited observational information. This ground truth uncertainty is ignored when estimating the ground truth by deterministically aggregating annotations, e.g., by majority voting or averaging. In contrast, we propose a framework where aggregation is done using a statistical model. Specifically, we frame aggregation of annotations as posterior inference of so-called plausibilities, representing distributions over classes in a classification setting, subject to a hyper-parameter encoding annotator reliability. Based on this model, we propose a metric for measuring annotation uncertainty and provide uncertainty-adjusted metrics for performance evaluation. We present a case study applying our framework to skin condition classification from images where annotations are provided in the form of differential diagnoses. The deterministic adjudication process called inverse rank normalization (IRN) from previous work ignores ground truth uncertainty in evaluation. Instead, we present two alternative statistical models: a probabilistic version of IRN and a Plackett-Luce-based model. We find that a large portion of the dataset exhibits significant ground truth uncertainty and standard IRN-based evaluation severely over-estimates performance without providing uncertainty estimates.

翻译：为确保安全性，医疗领域的AI系统在部署前需经过严格评估，其预测结果需与被视为确定的真实标签进行验证。然而实际情况并非如此，真实标签可能存在不确定性。遗憾的是，这在AI模型的标准评估中常被忽视，可能导致严重后果，例如高估未来性能。为规避这一问题，我们量化了真实标签不确定性的影响，并将其分解为两个主要部分：因缺乏可靠标注而产生的标注不确定性，以及因观测信息有限导致的固有不确定性。在通过确定性聚合标注（如多数投票或平均法）估算真实标签时，这种不确定性被忽略。相比之下，我们提出一种基于统计模型进行聚合的框架。具体而言，我们将标注聚合建模为所谓"似然性"的后验推断——在分类任务中表示类别上的概率分布，并受制于编码标注者可靠性的超参数。基于该模型，我们提出衡量标注不确定性的指标，并提供经不确定性调整的性能评估指标。我们通过一项案例研究展示该框架的应用：针对皮肤病变图像的分类任务，其中标注以鉴别诊断的形式提供。以往工作中使用的确定性判定过程（称为逆秩归一化，IRN）在评估中忽略了真实标签不确定性。为此，我们提出两种替代统计模型：概率化IRN模型和基于普莱克特-卢斯的模型。研究发现，数据集中大部分样本存在显著的真实标签不确定性，且基于IRN的标准评估方法在未提供不确定性估计的情况下严重高估了性能。