Data-driven technologies have improved the efficiency, reliability and effectiveness of healthcare services, but come with an increasing demand for data, which is challenging due to privacy-related constraints on sharing data in healthcare contexts. Synthetic data has recently gained popularity as potential solution, but in the flurry of current research it can be hard to oversee its potential. This paper proposes a novel taxonomy of synthetic data in healthcare to navigate the landscape in terms of three main varieties. Data Proportion comprises different ratios of synthetic data in a dataset and associated pros and cons. Data Modality refers to the different data formats amenable to synthesis and format-specific challenges. Data Transformation concerns improving specific aspects of a dataset like its utility or privacy with synthetic data. Our taxonomy aims to help researchers in the healthcare domain interested in synthetic data to grasp what types of datasets, data modalities, and transformations are possible with synthetic data, and where the challenges and overlaps between the varieties lie.
翻译:数据驱动技术提升了医疗服务的效率、可靠性与有效性,但随之而来的是对数据日益增长的需求,这在医疗领域因隐私相关限制而面临数据共享的挑战。合成数据作为潜在解决方案近期受到广泛关注,但在当前研究热潮中,其潜力难以全面把握。本文提出一种医疗领域合成数据的新颖分类法,从三个主要维度梳理该领域:数据比例涵盖数据集中合成数据的不同占比及其相关利弊;数据模态指适用于合成的不同数据格式及特定格式面临的挑战;数据转换涉及利用合成数据改进数据集的特定方面,如效用或隐私。本分类法旨在帮助医疗领域对合成数据感兴趣的研究者理解合成数据可生成的数据集类型、数据模态及转换方式,并明确各类别间的挑战与交叉点。