Due to patient privacy protection concerns, machine learning research in healthcare has been undeniably slower and limited than in other application domains. High-quality, realistic, synthetic electronic health records (EHRs) can be leveraged to accelerate methodological developments for research purposes while mitigating privacy concerns associated with data sharing. The current state-of-the-art model for synthetic EHR generation is generative adversarial networks, which are notoriously difficult to train and can suffer from mode collapse. Denoising Diffusion Probabilistic Models, a class of generative models inspired by statistical thermodynamics, have recently been shown to generate high-quality synthetic samples in certain domains. It is unknown whether these can generalize to generation of large-scale, high-dimensional EHRs. In this paper, we present a novel generative model based on diffusion models that is the first successful application on electronic health records. Our model proposes a mechanism to perform class-conditional sampling to preserve label information. We also introduce a new sampling strategy to accelerate the inference speed. We empirically show that our model outperforms existing state-of-the-art synthetic EHR generation methods.
翻译:由于患者隐私保护的限制,医疗健康领域的机器学习研究相较于其他应用领域发展明显更慢且更受限。高质量、逼真的合成电子健康记录(EHR)可以在缓解数据共享隐私问题的同时,加速旨在研究用途的方法学开发。目前用于合成EHR生成的最先进模型是生成对抗网络,这类模型训练难度极高且容易遭遇模式崩溃。受统计热力学启发的去噪扩散概率模型(一类生成模型)近期已被证明能在特定领域生成高质量合成样本,但其能否推广至大规模高维EHR生成仍有待验证。本文提出一种基于扩散模型的新型生成模型,这是该架构在电子健康记录上的成功首秀。我们的模型引入了一种类别条件采样机制以保留标签信息,同时提出了一种加速推理过程的新型采样策略。实验证明,该模型在性能上超越了现有最先进的合成EHR生成方法。