Synthetic data is gaining traction as a cost-effective solution for the increasing data demands of AI development and can be generated either from existing knowledge or derived data captured from real-world events. The source of the synthetic data generation and the technique used significantly impacts its residual privacy risk and therefore its opportunity for sharing. Traditional classification of synthetic data types no longer fit the newer generation techniques and there is a need to better align the classification with practical needs. We suggest a new way of grouping synthetic data types that better supports privacy evaluations to aid regulatory policymaking. Our novel classification provides flexibility to new advancements like deep generative methods and offers a more practical framework for future applications.
翻译:合成数据作为一种满足人工智能开发日益增长数据需求的经济高效解决方案正受到关注,它既可从现有知识生成,也可从真实世界事件捕获的衍生数据中产生。合成数据的生成来源及所用技术会显著影响其残留隐私风险,进而决定其共享潜力。传统的合成数据类型分类已无法适应新一代技术,亟需使分类体系更贴合实际需求。我们提出一种新的合成数据类型分组方法,能更好地支持隐私评估以辅助监管政策制定。这一创新分类体系为深度生成方法等新兴技术提供了灵活性,并为未来应用构建了更实用的框架。