Machine Learning (ML) has demonstrated its great potential on medical data analysis. Large datasets collected from diverse sources and settings are essential for ML models in healthcare to achieve better accuracy and generalizability. Sharing data across different healthcare institutions is challenging because of complex and varying privacy and regulatory requirements. Hence, it is hard but crucial to allow multiple parties to collaboratively train an ML model leveraging the private datasets available at each party without the need for direct sharing of those datasets or compromising the privacy of the datasets through collaboration. In this paper, we address this challenge by proposing Decentralized, Collaborative, and Privacy-preserving ML for Multi-Hospital Data (DeCaPH). It offers the following key benefits: (1) it allows different parties to collaboratively train an ML model without transferring their private datasets; (2) it safeguards patient privacy by limiting the potential privacy leakage arising from any contents shared across the parties during the training process; and (3) it facilitates the ML model training without relying on a centralized server. We demonstrate the generalizability and power of DeCaPH on three distinct tasks using real-world distributed medical datasets: patient mortality prediction using electronic health records, cell-type classification using single-cell human genomes, and pathology identification using chest radiology images. We demonstrate that the ML models trained with DeCaPH framework have an improved utility-privacy trade-off, showing it enables the models to have good performance while preserving the privacy of the training data points. In addition, the ML models trained with DeCaPH framework in general outperform those trained solely with the private datasets from individual parties, showing that DeCaPH enhances the model generalizability.
翻译:机器学习(ML)在医疗数据分析中展现出巨大潜力。从不同来源和场景收集的大规模数据集对于医疗领域的ML模型实现更高准确性与泛化能力至关重要。由于复杂且多变的隐私与法规要求,跨医疗机构共享数据极具挑战性。因此,允许多方在不直接共享私有数据集或通过协作而损害数据集隐私的前提下,利用各方私有数据集协同训练ML模型,虽困难却至关重要。本文通过提出面向多医院数据的去中心化、协作式隐私保护机器学习(DeCaPH)来解决这一挑战。该框架具有以下关键优势:(1)允许多方在不传输私有数据集的情况下协同训练ML模型;(2)通过限制训练过程中跨方共享内容可能引发的隐私泄露,保护患者隐私;(3)无需依赖中心化服务器即可实现ML模型训练。我们利用真实世界分布式医疗数据集在三个不同任务上验证了DeCaPH的泛化能力与性能:基于电子健康记录的患者死亡率预测、基于单细胞人类基因组的细胞类型分类,以及基于胸部放射影像的病理识别。结果表明,使用DeCaPH框架训练的ML模型在效用-隐私权衡方面得到改善,在保持训练数据点隐私的同时实现了良好性能。此外,使用DeCaPH框架训练的ML模型总体上优于仅使用单方私有数据集训练的模型,表明DeCaPH增强了模型泛化能力。