Machine Learning (ML) has demonstrated its great potential on medical data analysis. Large datasets collected from diverse sources and settings are essential for ML models in healthcare to achieve better accuracy and generalizability. Sharing data across different healthcare institutions is challenging because of complex and varying privacy and regulatory requirements. Hence, it is hard but crucial to allow multiple parties to collaboratively train an ML model leveraging the private datasets available at each party without the need for direct sharing of those datasets or compromising the privacy of the datasets through collaboration. In this paper, we address this challenge by proposing Decentralized, Collaborative, and Privacy-preserving ML for Multi-Hospital Data (DeCaPH). It offers the following key benefits: (1) it allows different parties to collaboratively train an ML model without transferring their private datasets; (2) it safeguards patient privacy by limiting the potential privacy leakage arising from any contents shared across the parties during the training process; and (3) it facilitates the ML model training without relying on a centralized server. We demonstrate the generalizability and power of DeCaPH on three distinct tasks using real-world distributed medical datasets: patient mortality prediction using electronic health records, cell-type classification using single-cell human genomes, and pathology identification using chest radiology images. We demonstrate that the ML models trained with DeCaPH framework have an improved utility-privacy trade-off, showing it enables the models to have good performance while preserving the privacy of the training data points. In addition, the ML models trained with DeCaPH framework in general outperform those trained solely with the private datasets from individual parties, showing that DeCaPH enhances the model generalizability.
翻译:机器学习(ML)在医学数据分析中展现出巨大潜力。由不同来源及环境收集的大规模数据集对于医疗领域的ML模型实现更高准确性和泛化能力至关重要。然而,由于复杂且各异的隐私与监管要求,跨医疗机构的医疗数据共享面临挑战。因此,如何在无需直接共享各机构私有数据集、且不因协作而损害数据隐私的前提下,允许多方共同训练ML模型,是一项关键难题。本文通过提出面向多医院数据的去中心化、协作式与隐私保护机器学习(DeCaPH)框架应对这一挑战。该框架具有以下核心优势:(1)允许不同参与方在不传输私有数据集的情况下协作训练ML模型;(2)通过限制训练过程中参与方间共享内容可能引发的隐私泄露风险,保障患者隐私;(3)无需依赖中心化服务器即可实现ML模型训练。我们利用真实世界分布式医疗数据集,在三个不同任务中验证了DeCaPH的泛化能力与效能:基于电子健康记录的患者死亡率预测、基于单细胞人类基因组的细胞类型分类,以及基于胸部放射图像的病理识别。实验表明,采用DeCaPH框架训练的ML模型在效用-隐私权衡上表现更优,既能保持良好性能,又能保护训练数据点的隐私。此外,整体而言,采用DeCaPH框架训练的ML模型性能优于仅使用单方私有数据集训练的模型,证明DeCaPH增强了模型的泛化能力。