In recent years, deep learning has been successfully adopted in a wide range of applications related to electronic health records (EHRs) such as representation learning and clinical event prediction. However, due to privacy constraints, limited access to EHR becomes a bottleneck for deep learning research. To mitigate these concerns, generative adversarial networks (GANs) have been successfully used for generating EHR data. However, there are still challenges in high-quality EHR generation, including generating time-series EHR data and imbalanced uncommon diseases. In this work, we propose a Multi-label Time-series GAN (MTGAN) to generate EHR and simultaneously improve the quality of uncommon disease generation. The generator of MTGAN uses a gated recurrent unit (GRU) with a smooth conditional matrix to generate sequences and uncommon diseases. The critic gives scores using Wasserstein distance to recognize real samples from synthetic samples by considering both data and temporal features. We also propose a training strategy to calculate temporal features for real data and stabilize GAN training. Furthermore, we design multiple statistical metrics and prediction tasks to evaluate the generated data. Experimental results demonstrate the quality of the synthetic data and the effectiveness of MTGAN in generating realistic sequential EHR data, especially for uncommon diseases.
翻译:近年来,深度学习已成功应用于电子健康记录(EHR)相关的广泛任务,例如表示学习和临床事件预测。然而,由于隐私限制,EHR数据访问受限成为深度学习研究的瓶颈。为缓解这些问题,生成对抗网络(GAN)已被成功用于生成EHR数据。然而,高质量EHR生成仍面临挑战,包括生成时间序列EHR数据以及处理不平衡的罕见疾病。本研究提出了一种多标签时间序列GAN(MTGAN)以生成EHR,并同时提升罕见疾病生成的质量。MTGAN的生成器使用带有平滑条件矩阵的门控循环单元(GRU)来生成序列和罕见疾病。判别器通过Wasserstein距离,结合数据特征与时间特征,为样本评分以区分真实样本与合成样本。我们还提出了一种训练策略,用于计算真实数据的时间特征并稳定GAN训练。此外,我们设计了多个统计指标和预测任务来评估生成数据。实验结果表明,合成数据质量优良,且MTGAN在生成逼真的序列EHR数据(尤其是罕见疾病)方面具有有效性。