Synthetic data are becoming a critical tool for building artificially intelligent systems. Simulators provide a way of generating data systematically and at scale. These data can then be used either exclusively, or in conjunction with real data, for training and testing systems. Synthetic data are particularly attractive in cases where the availability of ``real'' training examples might be a bottleneck. While the volume of data in healthcare is growing exponentially, creating datasets for novel tasks and/or that reflect a diverse set of conditions and causal relationships is not trivial. Furthermore, these data are highly sensitive and often patient specific. Recent research has begun to illustrate the potential for synthetic data in many areas of medicine, but no systematic review of the literature exists. In this paper, we present the cases for physical and statistical simulations for creating data and the proposed applications in healthcare and medicine. We discuss that while synthetics can promote privacy, equity, safety and continual and causal learning, they also run the risk of introducing flaws, blind spots and propagating or exaggerating biases.
翻译:合成数据正成为构建人工智能系统的关键工具。模拟器提供了一种系统化、规模化生成数据的方法。这些数据可单独使用,或与真实数据结合,用于训练和测试系统。在“真实”训练样本可能成为瓶颈的场景中,合成数据尤其具有吸引力。尽管医疗领域的数据量呈指数级增长,但为新型任务创建数据集,或反映多样化条件与因果关系的数据集,并非易事。此外,这些数据高度敏感,且通常与患者个体相关。近期研究已开始揭示合成数据在医学诸多领域的潜力,但目前尚无系统性文献综述。本文阐述了通过物理模拟与统计模拟生成数据的案例,及其在医疗与医学领域的拟议应用。我们讨论了合成数据虽能促进隐私保护、公平性、安全性及持续性与因果性学习,但也可能引入缺陷、盲区,并传播或放大偏差的风险。