A primer on synthetic health data

Recent advances in deep generative models have greatly expanded the potential to create realistic synthetic health datasets. These synthetic datasets aim to preserve the characteristics, patterns, and overall scientific conclusions derived from sensitive health datasets without disclosing patient identity or sensitive information. Thus, synthetic data can facilitate safe data sharing that supports a range of initiatives including the development of new predictive models, advanced health IT platforms, and general project ideation and hypothesis development. However, many questions and challenges remain, including how to consistently evaluate a synthetic dataset's similarity and predictive utility in comparison to the original real dataset and risk to privacy when shared. Additional regulatory and governance issues have not been widely addressed. In this primer, we map the state of synthetic health data, including generation and evaluation methods and tools, existing examples of deployment, the regulatory and ethical landscape, access and governance options, and opportunities for further development.

翻译：近年来，深度生成模型的进展极大地拓展了创建逼真合成健康数据集的潜力。此类合成数据集旨在保留从敏感健康数据集中衍生的特征、模式及整体科学结论，同时避免泄露患者身份或敏感信息。因此，合成数据可促进安全的数据共享，支持包括新型预测模型开发、先进健康IT平台建设、项目构思与假说形成等多项举措。然而，仍存在诸多问题与挑战，包括如何一致地评估合成数据集相较于原始真实数据集的相似性与预测效用，以及共享时的隐私泄露风险。此外，相关的法规与治理问题尚未得到广泛探讨。本入门指南梳理了合成健康数据的发展现状，涵盖生成与评估方法及工具、现有部署案例、监管与伦理环境、访问与治理方案，以及进一步发展的机遇。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

Nat. Biotechnol. | 机器学习为生物库驱动的药物发现提供动力

专知会员服务

11+阅读 · 2022年9月12日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日