SoK: Privacy-Preserving Data Synthesis

As the prevalence of data analysis grows, safeguarding data privacy has become a paramount concern. Consequently, there has been an upsurge in the development of mechanisms aimed at privacy-preserving data analyses. However, these approaches are task-specific; designing algorithms for new tasks is a cumbersome process. As an alternative, one can create synthetic data that is (ideally) devoid of private information. This paper focuses on privacy-preserving data synthesis (PPDS) by providing a comprehensive overview, analysis, and discussion of the field. Specifically, we put forth a master recipe that unifies two prominent strands of research in PPDS: statistical methods and deep learning (DL)-based methods. Under the master recipe, we further dissect the statistical methods into choices of modeling and representation, and investigate the DL-based methods by different generative modeling principles. To consolidate our findings, we provide comprehensive reference tables, distill key takeaways, and identify open problems in the existing literature. In doing so, we aim to answer the following questions: What are the design principles behind different PPDS methods? How can we categorize these methods, and what are the advantages and disadvantages associated with each category? Can we provide guidelines for method selection in different real-world scenarios? We proceed to benchmark several prominent DL-based methods on the task of private image synthesis and conclude that DP-MERF is an all-purpose approach. Finally, upon systematizing the work over the past decade, we identify future directions and call for actions from researchers.

翻译：随着数据分析的普及，保护数据隐私已成为一项至关重要的问题。因此，开发旨在实现隐私保护数据分析的机制日益增多。然而，这些方法通常针对特定任务；为新任务设计算法过程繁琐。作为一种替代方案，可以创建（理想情况下）不包含私人信息的合成数据。本文通过提供该领域的全面概述、分析和讨论，聚焦于隐私保护数据合成（PPDS）。具体而言，我们提出了一种主配方，统一了PPDS中两个显著的研究方向：统计方法和基于深度学习（DL）的方法。在主配方框架下，我们进一步将统计方法解构为建模与表示的选择，并根据不同的生成建模原理探究基于DL的方法。为了巩固我们的发现，我们提供了全面的参考表格，提炼关键要点，并指出现有文献中的开放问题。通过此工作，我们旨在回答以下问题：不同PPDS方法背后的设计原则是什么？我们如何对这些方法进行分类，每类方法又有哪些优缺点？我们能否为不同现实场景中的方法选择提供指导？接着，我们在私有图像合成任务上对几种突出的基于DL的方法进行基准测试，并得出结论：DP-MERF是一种通用方法。最后，在系统化梳理过去十年工作的基础上，我们指明未来方向，并呼吁研究人员采取行动。