Generating Synthetic Electronic Health Record (EHR) Data: A Review with Benchmarking

We conduct a scoping review of existing approaches for synthetic EHR data generation, and benchmark major methods with proposed open-source software to offer recommendations for practitioners. We search three academic databases for our scoping review. Methods are benchmarked on open-source EHR datasets, MIMIC-III/IV. Seven existing methods covering major categories and two baseline methods are implemented and compared. Evaluation metrics concern data fidelity, downstream utility, privacy protection, and computational cost. 42 studies are identified and classified into five categories. Seven open-source methods covering all categories are selected, trained on MIMIC-III, and evaluated on MIMIC-III or MIMIC-IV for transportability considerations. Among them, GAN-based methods demonstrate competitive performance in fidelity and utility on MIMIC-III; rule-based methods excel in privacy protection. Similar findings are observed on MIMIC-IV, except that GAN-based methods further outperform the baseline methods in preserving fidelity. A Python package, ``SynthEHRella'', is provided to integrate various choices of approaches and evaluation metrics, enabling more streamlined exploration and evaluation of multiple methods. We found that method choice is governed by the relative importance of the evaluation metrics in downstream use cases. We provide a decision tree to guide the choice among the benchmarked methods. Based on the decision tree, GAN-based methods excel when distributional shifts exist between the training and testing populations. Otherwise, CorGAN and MedGAN are most suitable for association modeling and predictive modeling, respectively. Future research should prioritize enhancing fidelity of the synthetic data while controlling privacy exposure, and comprehensive benchmarking of longitudinal or conditional generation methods.

翻译：我们对现有生成合成电子健康记录（EHR）数据的方法进行了范围综述，并使用提出的开源软件对主要方法进行基准测试，为实践者提供建议。我们检索了三个学术数据库以完成范围综述。方法在开源EHR数据集MIMIC-III/IV上进行了基准测试。我们实现并比较了涵盖主要类别的七种现有方法和两种基线方法。评估指标涉及数据保真度、下游效用、隐私保护与计算成本。共识别出42项研究并将其归类为五个类别。我们选择了覆盖所有类别的七种开源方法，在MIMIC-III上进行训练，并在MIMIC-III或MIMIC-IV上进行评估以考虑可迁移性。其中，基于GAN的方法在MIMIC-III上展现出在保真度和效用方面的竞争优势；基于规则的方法在隐私保护方面表现突出。在MIMIC-IV上观察到类似结论，但基于GAN的方法在保持保真度方面进一步超越了基线方法。我们提供了一个名为"SynthEHRella"的Python软件包，以集成多种方法选择和评估指标，从而实现对多种方法更高效的探索与评估。我们发现方法的选择取决于下游用例中评估指标的相对重要性。我们提供了一个决策树来指导在基准测试方法中进行选择。根据决策树，当训练与测试人群之间存在分布偏移时，基于GAN的方法表现优异。否则，CorGAN和MedGAN分别最适合关联建模和预测建模。未来的研究应优先考虑在控制隐私暴露的同时提升合成数据的保真度，并对纵向或条件生成方法进行全面基准测试。