Synthesize Extremely High-dimensional Longitudinal Electronic Health Records via Hierarchical Autoregressive Language Model

Synthetic electronic health records (EHRs) that are both realistic and preserve privacy can serve as an alternative to real EHRs for machine learning (ML) modeling and statistical analysis. However, generating high-fidelity and granular electronic health record (EHR) data in its original, highly-dimensional form poses challenges for existing methods due to the complexities inherent in high-dimensional data. In this paper, we propose Hierarchical Autoregressive Language mOdel (HALO) for generating longitudinal high-dimensional EHR, which preserve the statistical properties of real EHR and can be used to train accurate ML models without privacy concerns. Our HALO method, designed as a hierarchical autoregressive model, generates a probability density function of medical codes, clinical visits, and patient records, allowing for the generation of realistic EHR data in its original, unaggregated form without the need for variable selection or aggregation. Additionally, our model also produces high-quality continuous variables in a longitudinal and probabilistic manner. We conducted extensive experiments and demonstrate that HALO can generate high-fidelity EHR data with high-dimensional disease code probabilities (d > 10,000), disease co-occurrence probabilities within visits (d > 1,000,000), and conditional probabilities across consecutive visits (d > 5,000,000) and achieve above 0.9 R2 correlation in comparison to real EHR data. This performance then enables downstream ML models trained on its synthetic data to achieve comparable accuracy to models trained on real data (0.938 AUROC with HALO data vs. 0.943 with real data). Finally, using a combination of real and synthetic data enhances the accuracy of ML models beyond that achieved by using only real EHR data.

翻译：合成既逼真又能保护隐私的电子健康档案（EHR）可作为真实EHR的替代方案，用于机器学习建模和统计分析。然而，在其原始高维形式下生成高保真度且细粒度的电子健康档案数据，由于高维数据固有的复杂性，对现有方法构成了挑战。本文提出层次化自回归语言模型（HALO）用于生成纵向高维EHR，该模型能保留真实EHR的统计特性，并可用于训练无隐私顾虑的精确机器学习模型。我们设计的HALO方法作为一个层次化自回归模型，可生成医疗代码、临床就诊和患者记录的概率密度函数，从而允许在原始非聚合形式下生成逼真EHR数据，无需进行变量选择或聚合。此外，我们的模型还能以纵向和概率方式生成高质量的连续变量。我们进行了广泛实验，证明HALO能生成高保真EHR数据，包括高维疾病代码概率（d>10,000）、就诊内疾病共现概率（d>1,000,000）以及连续就诊间的条件概率（d>5,000,000），且与真实EHR数据相比可实现超过0.9的R²相关性。这一性能使得在其合成数据上训练的下游机器学习模型能达到与真实数据训练模型相当的准确率（HALO数据AUROC为0.938，真实数据为0.943）。最后，结合真实与合成数据可提升机器学习模型的准确率，使其超越仅使用真实EHR数据所能达到的水平。