Synthesize High-dimensional Longitudinal Electronic Health Records via Hierarchical Autoregressive Language Model

Synthetic electronic health records (EHRs) that are both realistic and preserve privacy can serve as an alternative to real EHRs for machine learning (ML) modeling and statistical analysis. However, generating high-fidelity and granular electronic health record (EHR) data in its original, highly-dimensional form poses challenges for existing methods due to the complexities inherent in high-dimensional data. In this paper, we propose Hierarchical Autoregressive Language mOdel (HALO) for generating longitudinal high-dimensional EHR, which preserve the statistical properties of real EHR and can be used to train accurate ML models without privacy concerns. Our HALO method, designed as a hierarchical autoregressive model, generates a probability density function of medical codes, clinical visits, and patient records, allowing for the generation of realistic EHR data in its original, unaggregated form without the need for variable selection or aggregation. Additionally, our model also produces high-quality continuous variables in a longitudinal and probabilistic manner. We conducted extensive experiments and demonstrate that HALO can generate high-fidelity EHR data with high-dimensional disease code probabilities (d > 10,000), disease co-occurrence probabilities within visits (d > 1,000,000), and conditional probabilities across consecutive visits (d > 5,000,000) and achieve above 0.9 R2 correlation in comparison to real EHR data. This performance then enables downstream ML models trained on its synthetic data to achieve comparable accuracy to models trained on real data (0.938 AUROC with HALO data vs. 0.943 with real data). Finally, using a combination of real and synthetic data enhances the accuracy of ML models beyond that achieved by using only real EHR data.

翻译：合成既真实又保护隐私的电子健康记录（EHR）可作为真实EHR的替代品，用于机器学习（ML）建模和统计分析。然而，由于高维数据固有的复杂性，现有方法在生成原始高维形式下高保真且精细化的电子健康记录（EHR）数据时面临挑战。本文提出分层自回归语言模型（HALO），用于生成纵向高维EHR，该模型保留了真实EHR的统计特性，且可训练准确ML模型而无需担心隐私问题。我们的HALO方法设计为分层自回归模型，可生成医疗代码、临床就诊记录和患者记录的概率密度函数，从而无需变量选择或聚合即可生成原始未聚合形式的逼真EHR数据。此外，模型还能以纵向和概率方式生成高质量连续变量。通过广泛实验证明，HALO能够生成具有高维疾病代码概率（d>10,000）、就诊内疾病共现概率（d>1,000,000）以及连续就诊间条件概率（d>5,000,000）的高保真EHR数据，与真实EHR数据相比达到0.9以上的R2相关性。这一性能使得基于合成数据训练的下游ML模型能够获得与基于真实数据训练的模型相当的精度（HALO数据AUROC为0.938，真实数据为0.943）。最后，结合真实与合成数据可进一步提升ML模型精度，超越仅使用真实EHR数据的效果。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日