Evaluating the Fairness of Deep Learning Uncertainty Estimates in Medical Image Analysis

Although deep learning (DL) models have shown great success in many medical image analysis tasks, deployment of the resulting models into real clinical contexts requires: (1) that they exhibit robustness and fairness across different sub-populations, and (2) that the confidence in DL model predictions be accurately expressed in the form of uncertainties. Unfortunately, recent studies have indeed shown significant biases in DL models across demographic subgroups (e.g., race, sex, age) in the context of medical image analysis, indicating a lack of fairness in the models. Although several methods have been proposed in the ML literature to mitigate a lack of fairness in DL models, they focus entirely on the absolute performance between groups without considering their effect on uncertainty estimation. In this work, we present the first exploration of the effect of popular fairness models on overcoming biases across subgroups in medical image analysis in terms of bottom-line performance, and their effects on uncertainty quantification. We perform extensive experiments on three different clinically relevant tasks: (i) skin lesion classification, (ii) brain tumour segmentation, and (iii) Alzheimer's disease clinical score regression. Our results indicate that popular ML methods, such as data-balancing and distributionally robust optimization, succeed in mitigating fairness issues in terms of the model performances for some of the tasks. However, this can come at the cost of poor uncertainty estimates associated with the model predictions. This tradeoff must be mitigated if fairness models are to be adopted in medical image analysis.

翻译：尽管深度学习（DL）模型在许多医学图像分析任务中取得了巨大成功，但将其部署到真实临床环境需要满足两个条件：（1）模型在不同亚群间具有鲁棒性和公平性；（2）深度学习模型预测的置信度能够以不确定性形式准确表达。遗憾的是，近期研究确实表明，在医学图像分析背景下，深度学习模型在人口统计学亚组（如种族、性别、年龄）间存在显著偏差，这表明模型缺乏公平性。尽管机器学习文献中已提出多种方法以缓解深度学习模型公平性缺失问题，但这些方法完全聚焦于组间的绝对性能表现，而未考虑其对不确定性估计的影响。本研究首次探索了主流公平性模型在医学图像分析中克服亚组偏差对底层性能的影响，以及其对不确定性量化的作用。我们在三项不同的临床相关任务上开展大量实验：（i）皮肤病变分类、（ii）脑肿瘤分割，以及（iii）阿尔茨海默病临床评分回归。结果表明，数据平衡与分布鲁棒优化等主流机器学习方法虽能成功缓解部分任务中模型性能的公平性问题，但这可能以模型预测的不确定性估计劣化为代价。若要在医学图像分析中采用公平性模型，必须权衡这一矛盾。