Safeguarding personal information is paramount for healthcare data sharing, a challenging issue without any silver bullet thus far. We study the prospect of a recent deep-learning advent, dataset condensation (DC), in sharing healthcare data for AI research, and the results are promising. The condensed data abstracts original records and irreversibly conceals individual-level knowledge to achieve a bona fide de-identification, which permits free sharing. Moreover, the original deep-learning utilities are well preserved in the condensed data with compressed volume and accelerated model convergences. In PhysioNet-2012, a condensed dataset of 20 samples can orient deep models attaining 80.3% test AUC of mortality prediction (versus 85.8% of 5120 original records), an inspiring discovery generalised to MIMIC-III and Coswara datasets. We also interpret the inhere privacy protections of DC through theoretical analysis and empirical evidence. Dataset condensation opens a new gate to sharing healthcare data for AI research with multiple desirable traits.
翻译:保护个人信息对于医疗数据共享至关重要,但目前尚无完美解决方案。我们研究了深度学习领域的新兴技术——数据集压缩(DC)在医疗数据共享用于AI研究中的应用前景,结果令人振奋。压缩后的数据能够提取原始记录的关键特征,并通过不可逆的方式隐藏个体层面的知识,实现真正的去标识化,从而允许自由共享。此外,原始深度学习模型的效用被完好保留,同时数据量级压缩、模型收敛速度提升。在PhysioNet-2012数据集中,仅含20个样本的压缩数据集可引导深度学习模型在死亡率预测任务中达到80.3%的测试AUC(相比之下,原始5120个样本的AUC为85.8%),这一鼓舞人心的发现同样在MIMIC-III和Coswara数据集中得到验证。我们还通过理论分析和实验证据阐释了DC内在的隐私保护机制。数据集压缩为医疗数据共享用于AI研究开辟了一条兼具多重优势的新路径。