Recent advances in generative modelling have led many to see synthetic data as the go-to solution for a range of problems around data access, scarcity, and under-representation. In this paper, we study three prominent use cases: (1) Sharing synthetic data as a proxy for proprietary datasets to enable statistical analyses while protecting privacy, (2) Augmenting machine learning training sets with synthetic data to improve model performance, and (3) Augmenting datasets with synthetic data to reduce variance in statistical estimation. For each use case, we formalise the problem setting and study, through formal analysis and case studies, under which conditions synthetic data can achieve its intended objectives. We identify fundamental and practical limits that constrain when synthetic data can serve as an effective solution for a particular problem. Our analysis reveals that due to these limits many existing or envisioned use cases of synthetic data are a poor problem fit. Our formalisations and classification of synthetic data use cases enable decision makers to assess whether synthetic data is a suitable approach for their specific data availability problem.
翻译:生成建模的最新进展使许多人将合成数据视为解决数据访问、稀缺性和代表性不足等一系列问题的首选方案。本文研究了三个主要应用场景:(1) 将合成数据作为专有数据集的代理进行共享,在保护隐私的同时支持统计分析;(2) 利用合成数据扩充机器学习训练集以提升模型性能;(3) 通过合成数据增强数据集以降低统计估计的方差。针对每个应用场景,我们通过形式化分析和案例研究,明确了问题设定并探讨了合成数据在何种条件下能够实现其预期目标。我们揭示了制约合成数据在特定问题中能否成为有效解决方案的基础性限制与实践性局限。分析表明,由于这些限制,许多现有或设想中的合成数据应用场景与实际问题并不匹配。本研究对合成数据应用场景的形式化描述与分类,能够帮助决策者评估合成数据是否适合解决其特定的数据可用性问题。