Synthetic Electronic Health Records (EHR) have emerged as a pivotal tool in advancing healthcare applications and machine learning models, particularly for researchers without direct access to healthcare data. Although existing methods, like rule-based approaches and generative adversarial networks (GANs), generate synthetic data that resembles real-world EHR data, these methods often use a tabular format, disregarding temporal dependencies in patient histories and limiting data replication. Recently, there has been a growing interest in leveraging Generative Pre-trained Transformers (GPT) for EHR data. This enables applications like disease progression analysis, population estimation, counterfactual reasoning, and synthetic data generation. In this work, we focus on synthetic data generation and demonstrate the capability of training a GPT model using a particular patient representation derived from CEHR-BERT, enabling us to generate patient sequences that can be seamlessly converted to the Observational Medical Outcomes Partnership (OMOP) data format.
翻译:合成电子健康记录(EHR)已成为推进医疗应用和机器学习模型的关键工具,尤其对于无法直接获取医疗数据的研究人员而言。尽管现有方法(如基于规则的方法和生成对抗网络GAN)能生成与真实EHR数据相似的合成数据,但这些方法通常采用表格格式,忽略了患者病程中的时间依赖性,从而限制了数据复现能力。近年来,利用生成式预训练Transformer(GPT)处理EHR数据的研究日益受到关注。这使得疾病进展分析、人口估计、反事实推理及合成数据生成等应用成为可能。本研究聚焦于合成数据生成,展示了通过CEHR-BERT衍生的一种特定患者表示来训练GPT模型的能力,从而能够生成可无缝转换为观测医疗结果合作组织(OMOP)数据格式的患者序列。