Electronic health records (EHRs) have become the foundation of machine learning applications in healthcare, while the utility of real patient records is often limited by privacy and security concerns. Synthetic EHR generation provides an additional perspective to compensate for this limitation. Most existing methods synthesize new records based on real EHR data, without consideration of different types of events in EHR data, which cannot control the event combinations in line with medical common sense. In this paper, we propose MSIC, a Multi-visit health Status Inference model for Collaborative EHR synthesis to address these limitations. First, we formulate the synthetic EHR generation process as a probabilistic graphical model and tightly connect different types of events by modeling the latent health states. Then, we derive a health state inference method tailored for the multi-visit scenario to effectively utilize previous records to synthesize current and future records. Furthermore, we propose to generate medical reports to add textual descriptions for each medical event, providing broader applications for synthesized EHR data. For generating different paragraphs in each visit, we incorporate a multi-generator deliberation framework to collaborate the message passing of multiple generators and employ a two-phase decoding strategy to generate high-quality reports. Our extensive experiments on the widely used benchmarks, MIMIC-III and MIMIC-IV, demonstrate that MSIC advances state-of-the-art results on the quality of synthetic data while maintaining low privacy risks.
翻译:电子健康记录已成为医疗健康领域机器学习应用的基础,而真实患者记录的实用性常受限于隐私和安全问题。合成电子健康记录的生成提供了弥补这一局限性的新视角。现有方法大多基于真实电子健康记录数据合成新记录,但未考虑其中不同类型事件间的关联,导致无法控制符合医学常识的事件组合。本文提出MSIC模型——一种用于协同电子健康记录合成的多就诊健康状态推断模型。首先,我们将合成电子健康记录生成过程形式化为概率图模型,通过建模潜在健康状态紧密关联不同类型事件;其次,我们推导出针对多就诊场景的健康状态推断方法,充分利用既往记录合成当前及未来记录;最后,我们提出生成医疗报告为每个医疗事件添加文本描述,拓展合成电子健康记录数据的应用场景。针对每次就诊中不同段落的生成,我们引入多生成器协商框架协同多个生成器的信息传递,并采用两阶段解码策略生成高质量报告。在广泛使用的MIMIC-III和MIMIC-IV基准数据集上的大量实验表明,MSIC在合成数据质量上达到当前最优水平,同时保持低隐私风险。