Private, fair and accurate: Training large-scale, privacy-preserving AI models in medical imaging

Artificial intelligence (AI) models are increasingly used in the medical domain. However, as medical data is highly sensitive, special precautions to ensure its protection are required. The gold standard for privacy preservation is the introduction of differential privacy (DP) to model training. Prior work indicates that DP has negative implications on model accuracy and fairness, which are unacceptable in medicine and represent a main barrier to the widespread use of privacy-preserving techniques. In this work, we evaluated the effect of privacy-preserving training of AI models regarding accuracy and fairness compared to non-private training. For this, we used two datasets: (1) A large dataset (N=193,311) of high quality clinical chest radiographs, and (2) a dataset (N=1,625) of 3D abdominal computed tomography (CT) images, with the task of classifying the presence of pancreatic ductal adenocarcinoma (PDAC). Both were retrospectively collected and manually labeled by experienced radiologists. We then compared non-private deep convolutional neural networks (CNNs) and privacy-preserving (DP) models with respect to privacy-utility trade-offs measured as area under the receiver-operator-characteristic curve (AUROC), and privacy-fairness trade-offs, measured as Pearson's r or Statistical Parity Difference. We found that, while the privacy-preserving trainings yielded lower accuracy, they did largely not amplify discrimination against age, sex or co-morbidity. Our study shows that -- under the challenging realistic circumstances of a real-life clinical dataset -- the privacy-preserving training of diagnostic deep learning models is possible with excellent diagnostic accuracy and fairness.

翻译：人工智能（AI）模型在医学领域的应用日益广泛。然而，由于医疗数据高度敏感，需要采取特殊保护措施。隐私保护的黄金标准是在模型训练中引入差分隐私（DP）。先前研究表明，DP对模型准确性和公平性产生负面影响，这在医学领域不可接受，并成为隐私保护技术广泛使用的主要障碍。本研究评估了隐私保护训练对AI模型准确性和公平性的影响，并与非隐私训练进行比较。我们使用了两个数据集：（1）包含193,311张高质量临床胸部X光片的大型数据集；（2）包含1,625例三维腹部计算机断层扫描（CT）图像的数据集，任务为分类是否存在胰腺导管腺癌（PDAC）。两个数据集均为回顾性收集并由经验丰富的放射科医生手动标注。随后，我们比较了非隐私深度卷积神经网络（CNN）与隐私保护（DP）模型，隐私-效用权衡通过受试者工作特征曲线下面积（AUROC）衡量，隐私-公平性权衡通过皮尔逊相关系数或统计奇偶性差异衡量。我们发现，虽然隐私保护训练的准确性较低，但在多数情况下并未放大对年龄、性别或合并症的歧视。我们的研究表明，在真实临床数据集的挑战性现实条件下，采用隐私保护的诊断深度学习模型可以同时实现优异的诊断准确性和公平性。