The success of deep learning models deployed in the real world depends critically on their ability to generalize well across diverse data domains. Here, we address a fundamental challenge with selective classification during automated diagnosis with domain-shifted medical images. In this scenario, models must learn to avoid making predictions when label confidence is low, especially when tested with samples far removed from the training set (covariate shift). Such uncertain cases are typically referred to the clinician for further analysis and evaluation. Yet, we show that even state-of-the-art domain generalization approaches fail severely during referral when tested on medical images acquired from a different demographic or using a different technology. We examine two benchmark diagnostic medical imaging datasets exhibiting strong covariate shifts: i) diabetic retinopathy prediction with retinal fundus images and ii) multilabel disease prediction with chest X-ray images. We show that predictive uncertainty estimates do not generalize well under covariate shifts leading to non-monotonic referral curves, and severe drops in performance (up to 50%) at high referral rates (>70%). We evaluate novel combinations of robust generalization and post hoc referral approaches, that rescue these failures and achieve significant performance improvements, typically >10%, over baseline methods. Our study identifies a critical challenge with referral in domain-shifted medical images and finds key applications in reliable, automated disease diagnosis.
翻译:深度学习模型在现实世界中成功部署的关键在于其对多样化数据域的泛化能力。本文聚焦于领域偏移医学图像自动诊断中选择性分类的基本挑战。在此场景下,模型必须学会在标签置信度低时避免做出预测,尤其当测试样本与训练集存在显著差异(协变量偏移)时。此类不确定病例通常转诊至临床医生进行进一步分析与评估。然而,我们揭示出,即使最先进的域泛化方法在跨不同人口统计数据或基于不同技术采集的医学图像上进行转诊测试时,仍会出现严重失效。我们研究了两个呈现强烈协变量偏移的基准诊断性医学影像数据集:i) 基于视网膜眼底图像的糖尿病视网膜病变预测,以及ii) 基于胸部X光图像的多标签疾病预测。结果显示,在协变量偏移下,预测不确定性估计无法良好泛化,导致转诊曲线非单调变化,且在高转诊率(>70%)下性能出现严重下降(高达50%)。我们评估了鲁棒泛化与事后转诊方法的创新组合,这些方法能够挽救此类失效,并在基线方法基础上实现显著的性能提升(通常>10%)。本研究揭示了领域偏移医学图像转诊中的关键挑战,并为可靠的自动化疾病诊断找到重要应用场景。