Recent advances in deep generative models have greatly expanded the potential to create realistic synthetic health datasets. These synthetic datasets aim to preserve the characteristics, patterns, and overall scientific conclusions derived from sensitive health datasets without disclosing patient identity or sensitive information. Thus, synthetic data can facilitate safe data sharing that supports a range of initiatives including the development of new predictive models, advanced health IT platforms, and general project ideation and hypothesis development. However, many questions and challenges remain, including how to consistently evaluate a synthetic dataset's similarity and predictive utility in comparison to the original real dataset and risk to privacy when shared. Additional regulatory and governance issues have not been widely addressed. In this primer, we map the state of synthetic health data, including generation and evaluation methods and tools, existing examples of deployment, the regulatory and ethical landscape, access and governance options, and opportunities for further development.
翻译:近年来,深度生成模型的进展极大地拓展了创建逼真合成健康数据集的潜力。此类合成数据集旨在保留从敏感健康数据集中衍生的特征、模式及整体科学结论,同时避免泄露患者身份或敏感信息。因此,合成数据可促进安全的数据共享,支持包括新型预测模型开发、先进健康IT平台建设、项目构思与假说形成等多项举措。然而,仍存在诸多问题与挑战,包括如何一致地评估合成数据集相较于原始真实数据集的相似性与预测效用,以及共享时的隐私泄露风险。此外,相关的法规与治理问题尚未得到广泛探讨。本入门指南梳理了合成健康数据的发展现状,涵盖生成与评估方法及工具、现有部署案例、监管与伦理环境、访问与治理方案,以及进一步发展的机遇。