Risk of Bias in Chest Radiography Deep Learning Foundation Models

Purpose: To analyze a recently published chest radiography foundation model for the presence of biases that could lead to subgroup performance disparities across biological sex and race. Materials and Methods: This retrospective study used 127,118 chest radiographs from 42,884 patients (mean age, 63 [SD] 17 years; 23,623 male, 19,261 female) from the CheXpert dataset collected between October 2002 and July 2017. To determine the presence of bias in features generated by a chest radiography foundation model and baseline deep learning model, dimensionality reduction methods together with two-sample Kolmogorov-Smirnov tests were used to detect distribution shifts across sex and race. A comprehensive disease detection performance analysis was then performed to associate any biases in the features to specific disparities in classification performance across patient subgroups. Results: Ten out of twelve pairwise comparisons across biological sex and race showed statistically significant differences in the studied foundation model, compared with four significant tests in the baseline model. Significant differences were found between male and female (P < .001) and Asian and Black patients (P < .001) in the feature projections that primarily capture disease. Compared with average model performance across all subgroups, classification performance on the 'no finding' label dropped between 6.8% and 7.8% for female patients, and performance in detecting 'pleural effusion' dropped between 10.7% and 11.6% for Black patients. Conclusion: The studied chest radiography foundation model demonstrated racial and sex-related bias leading to disparate performance across patient subgroups and may be unsafe for clinical applications.

翻译：目的：分析近期发表的一个胸部X光基础模型，检验其是否存在导致生物性别和种族亚组间性能差异的偏差。材料与方法：本回顾性研究使用了来自CheXpert数据集的42,884例患者的127,118张胸部X光片（平均年龄63 [标准差17] 岁；男性23,623例，女性19,261例），数据收集时间为2002年10月至2017年7月。为检测胸部X光基础模型与基线深度学习模型生成特征中是否存在偏差，采用降维方法结合双样本Kolmogorov-Smirnov检验来识别性别和种族间的分布偏移。随后进行全面的疾病检测性能分析，将特征中的任何偏差与不同患者亚组分类性能的具体差异相关联。结果：在生物性别和种族的12组成对比较中，基础模型有10组显示出统计学显著差异，而基线模型仅有4组显著。在主要捕获疾病的特征投影中，男性和女性患者（P＜.001）以及亚裔和黑人患者（P＜.001）之间发现显著差异。与所有亚组的平均模型性能相比，女性患者在“无异常发现”标签上的分类性能下降了6.8%至7.8%，而黑人在检测“胸腔积液”方面的性能下降了10.7%至11.6%。结论：所研究的胸部X光基础模型表现出种族和性别相关偏差，导致不同患者亚组间的性能差异，可能不适合临床安全应用。