Privacy-preserving synthetic data offers a promising solution to harness segregated data in high-stakes domains where information is compartmentalized for regulatory, privacy, or institutional reasons. This survey provides a comprehensive framework for understanding the landscape of privacy-preserving synthetic data, presenting the theoretical foundations of generative models and differential privacy followed by a review of state-of-the-art methods across tabular data, images, and text. Our synthesis of evaluation approaches highlights the fundamental trade-off between utility for down-stream tasks and privacy guarantees, while identifying critical research gaps: the lack of realistic benchmarks representing specialized domains and insufficient empirical evaluations required to contextualise formal guarantees. Through empirical analysis of four leading methods on five real-world datasets from specialized domains, we demonstrate significant performance degradation under realistic privacy constraints ($\epsilon \leq 4$), revealing a substantial gap between results reported on general domain benchmarks and performance on domain-specific data. %Our findings highlight key challenges including unaccounted privacy leakage, insufficient empirical verification of formal guarantees, and a critical deficit of realistic benchmarks. These challenges underscore the need for robust evaluation frameworks, standardized benchmarks for specialized domains, and improved techniques to address the unique requirements of privacy-sensitive fields such that this technology can deliver on its considerable potential.
翻译:隐私保护合成数据为在高风险领域中利用隔离数据提供了一种前景广阔的解决方案,这些领域的信息常因监管、隐私或制度原因而被分隔。本综述提供了一个理解隐私保护合成数据领域的综合框架,阐述了生成模型与差分隐私的理论基础,并回顾了表格数据、图像和文本领域的最新方法。我们对评估方法的综合分析突显了下游任务效用与隐私保证之间的根本权衡,同时指出了关键的研究空白:缺乏代表专业领域的真实基准,以及用以情境化形式化保证的实证评估不足。通过对来自专业领域的五个真实数据集上四种主流方法的实证分析,我们证明了在现实隐私约束($\epsilon \leq 4$)下性能的显著下降,揭示了通用领域基准报告结果与特定领域数据性能之间的巨大差距。这些挑战凸显了对稳健评估框架、标准化专业领域基准以及改进技术以应对隐私敏感领域独特需求的必要性,从而使该技术能够充分发挥其巨大潜力。