SoK: Privacy-Preserving Data Synthesis

As the prevalence of data analysis grows, safeguarding data privacy has become a paramount concern. Consequently, there has been an upsurge in the development of mechanisms aimed at privacy-preserving data analyses. However, these approaches are task-specific; designing algorithms for new tasks is a cumbersome process. As an alternative, one can create synthetic data that is (ideally) devoid of private information. This paper focuses on privacy-preserving data synthesis (PPDS) by providing a comprehensive overview, analysis, and discussion of the field. Specifically, we put forth a master recipe that unifies two prominent strands of research in PPDS: statistical methods and deep learning (DL)-based methods. Under the master recipe, we further dissect the statistical methods into choices of modeling and representation, and investigate the DL-based methods by different generative modeling principles. To consolidate our findings, we provide comprehensive reference tables, distill key takeaways, and identify open problems in the existing literature. In doing so, we aim to answer the following questions: What are the design principles behind different PPDS methods? How can we categorize these methods, and what are the advantages and disadvantages associated with each category? Can we provide guidelines for method selection in different real-world scenarios? We proceed to benchmark several prominent DL-based methods on the task of private image synthesis and conclude that DP-MERF is an all-purpose approach. Finally, upon systematizing the work over the past decade, we identify future directions and call for actions from researchers.

翻译：随着数据分析的普及，数据隐私保护已成为首要关注问题。因此，旨在实现隐私保护数据分析的机制开发激增。然而，这些方法通常针对特定任务；为新型任务设计算法是一个繁琐的过程。作为替代方案，可以创建（理想情况下）不含私人信息的合成数据。本文通过提供该领域的全面概述、分析和讨论，聚焦于隐私保护数据合成（PPDS）。具体而言，我们提出一个通用框架，统一了PPDS中两大主流研究方向：统计方法和基于深度学习的方法。在该通用框架下，我们进一步将统计方法拆解为建模与表示的选择，并根据不同生成建模原理研究基于深度学习的方法。为巩固我们的发现，我们提供了全面的参考表格，提炼关键结论，并指出现有文献中的开放性问题。在此过程中，我们旨在解答以下问题：不同PPDS方法背后的设计原则是什么？如何对这些方法进行分类，每类方法的优缺点是什么？能否为不同实际场景下的方法选择提供指导？我们进一步在私有图像合成任务上对几种代表性深度学习方法进行基准测试，并得出结论：DP-MERF是一种通用方法。最后，通过对过去十年工作的系统化梳理，我们确定了未来研究方向，并呼吁研究者采取行动。