As artificial intelligence (AI) rapidly approaches human-level performance in medical imaging, it is crucial that it does not exacerbate or propagate healthcare disparities. Prior research has established AI's capacity to infer demographic data from chest X-rays, leading to a key concern: do models using demographic shortcuts have unfair predictions across subpopulations? In this study, we conduct a thorough investigation into the extent to which medical AI utilizes demographic encodings, focusing on potential fairness discrepancies within both in-distribution training sets and external test sets. Our analysis covers three key medical imaging disciplines: radiology, dermatology, and ophthalmology, and incorporates data from six global chest X-ray datasets. We confirm that medical imaging AI leverages demographic shortcuts in disease classification. While correcting shortcuts algorithmically effectively addresses fairness gaps to create "locally optimal" models within the original data distribution, this optimality is not true in new test settings. Surprisingly, we find that models with less encoding of demographic attributes are often most "globally optimal", exhibiting better fairness during model evaluation in new test environments. Our work establishes best practices for medical imaging models which maintain their performance and fairness in deployments beyond their initial training contexts, underscoring critical considerations for AI clinical deployments across populations and sites.
翻译:随着人工智能(AI)在医学影像领域迅速逼近人类水平的表现,确保其不会加剧或传播医疗保健差距至关重要。先前研究已证实AI能够从胸部X光片中推断人口统计学数据,这引发了一个关键问题:使用人口统计学捷径的模型是否会在不同亚群中产生不公平预测?在本研究中,我们深入探讨医学AI利用人口统计学编码的程度,重点关注分布内训练集和外部测试集中潜在的公平性差异。我们的分析涵盖三个关键医学影像学科:放射学、皮肤科和眼科学,并纳入来自六个全球胸部X光片数据集的数据。我们证实医学影像AI在疾病分类中利用了人口统计学捷径。虽然通过算法纠正捷径能有效解决公平性差距,在原始数据分布内创建“局部最优”模型,但这种最优性在新测试环境中并不成立。令人惊讶的是,我们发现编码较少人口统计学属性的模型往往最为“全局最优”,在新测试环境下的模型评估中表现出更好的公平性。本研究为医学影像模型建立了最佳实践,使其在初始训练环境之外的部署中保持性能和公平性,强调了AI跨人群和医疗站点临床部署的关键考量。