Developing robust and effective artificial intelligence (AI) models in medicine requires access to large amounts of patient data. The use of AI models solely trained on large multi-institutional datasets can help with this, yet the imperative to ensure data privacy remains, particularly as membership inference risks breaching patient confidentiality. As a proposed remedy, we advocate for the integration of differential privacy (DP). We specifically investigate the performance of models trained with DP as compared to models trained without DP on data from institutions that the model had not seen during its training (i.e., external validation) - the situation that is reflective of the clinical use of AI models. By leveraging more than 590,000 chest radiographs from five institutions, we evaluated the efficacy of DP-enhanced domain transfer (DP-DT) in diagnosing cardiomegaly, pleural effusion, pneumonia, atelectasis, and in identifying healthy subjects. We juxtaposed DP-DT with non-DP-DT and examined diagnostic accuracy and demographic fairness using the area under the receiver operating characteristic curve (AUC) as the main metric, as well as accuracy, sensitivity, and specificity. Our results show that DP-DT, even with exceptionally high privacy levels (epsilon around 1), performs comparably to non-DP-DT (P>0.119 across all domains). Furthermore, DP-DT led to marginal AUC differences - less than 1% - for nearly all subgroups, relative to non-DP-DT. Despite consistent evidence suggesting that DP models induce significant performance degradation for on-domain applications, we show that off-domain performance is almost not affected. Therefore, we ardently advocate for the adoption of DP in training diagnostic medical AI models, given its minimal impact on performance.
翻译:开发鲁棒且有效的医学人工智能(AI)模型需要获取大量患者数据。仅基于大型多机构数据集训练的AI模型可助力于此,但数据隐私保障仍是必要要求,尤其是成员推断攻击可能泄露患者隐私。作为提议的解决方案,我们主张整合差分隐私(DP)。我们专门研究了使用DP训练的模型相比未使用DP训练的模型,在模型训练时未见过的机构数据(即外部验证)上的表现——这反映了AI模型的临床应用场景。通过利用来自五个机构的超过59万张胸部X光片,我们评估了差分隐私增强领域迁移(DP-DT)在诊断心脏肥大、胸腔积液、肺炎、肺不张及识别健康受试者中的有效性。我们将DP-DT与非DP-DT进行对比,并以受试者工作特征曲线下面积(AUC)作为主要指标,同时使用准确率、敏感性和特异性来评估诊断准确性和人口统计学公平性。研究结果表明,即使在隐私水平极高(epsilon约等于1)的情况下,DP-DT的表现与非DP-DT相当(所有领域的P>0.119)。此外,相较于非DP-DT,DP-DT导致几乎所有亚组的AUC差异小于1%。尽管有确凿证据表明DP模型会导致域内应用性能显著下降,但我们发现域外性能几乎不受影响。因此,鉴于差分隐私对性能的影响极小,我们强烈建议在训练诊断性医学AI模型时采用差分隐私。