Synthetic Electronic Health Records (EHR) have emerged as a pivotal tool in advancing healthcare applications and machine learning models, particularly for researchers without direct access to healthcare data. Although existing methods, like rule-based approaches and generative adversarial networks (GANs), generate synthetic data that resembles real-world EHR data, these methods often use a tabular format, disregarding temporal dependencies in patient histories and limiting data replication. Recently, there has been a growing interest in leveraging Generative Pre-trained Transformers (GPT) for EHR data. This enables applications like disease progression analysis, population estimation, counterfactual reasoning, and synthetic data generation. In this work, we focus on synthetic data generation and demonstrate the capability of training a GPT model using a particular patient representation derived from CEHR-BERT, enabling us to generate patient sequences that can be seamlessly converted to the Observational Medical Outcomes Partnership (OMOP) data format.
翻译:合成电子健康记录已成为推进医疗应用和机器学习模型的关键工具,尤其适用于无法直接获取医疗数据的研究人员。尽管现有方法(如基于规则的算法和生成对抗网络)能够生成类似于真实电子健康记录的合成数据,但这些方法通常采用表格格式,忽略了患者病史中的时间依赖性,限制了数据复制的效果。近年来,利用生成式预训练变换器处理电子健康记录数据的研究日益受到关注。这使其在疾病进展分析、群体估计、反事实推理及合成数据生成等应用成为可能。本研究聚焦于合成数据生成,展示了基于CEHR-BERT衍生患者表征来训练GPT模型的能力,从而生成可无缝转换为观察性医疗结果合作伙伴数据格式的患者序列。