Tabular Data Synthesis with Differential Privacy: A Survey

Data sharing is a prerequisite for collaborative innovation, enabling organizations to leverage diverse datasets for deeper insights. In real-world applications like FinTech and Smart Manufacturing, transactional data, often in tabular form, are generated and analyzed for insight generation. However, such datasets typically contain sensitive personal/business information, raising privacy concerns and regulatory risks. Data synthesis tackles this by generating artificial datasets that preserve the statistical characteristics of real data, removing direct links to individuals. However, attackers can still infer sensitive information using background knowledge. Differential privacy offers a solution by providing provable and quantifiable privacy protection. Consequently, differentially private data synthesis has emerged as a promising approach to privacy-aware data sharing. This paper provides a comprehensive overview of existing differentially private tabular data synthesis methods, highlighting the unique challenges of each generation model for generating tabular data under differential privacy constraints. We classify the methods into statistical and deep learning-based approaches based on their generation models, discussing them in both centralized and distributed environments. We evaluate and compare those methods within each category, highlighting their strengths and weaknesses in terms of utility, privacy, and computational complexity. Additionally, we present and discuss various evaluation methods for assessing the quality of the synthesized data, identify research gaps in the field and directions for future research.

翻译：数据共享是协同创新的先决条件，使组织能够利用多样化数据集获得更深入的洞察。在金融科技和智能制造等实际应用中，通常以表格形式存在的交易数据被生成并用于分析洞察。然而，此类数据集通常包含敏感的个人/商业信息，引发隐私担忧和监管风险。数据合成通过生成保持真实数据统计特征的人工数据集来解决这一问题，消除与个体的直接关联。但攻击者仍可利用背景知识推断敏感信息。差分隐私通过提供可证明且可量化的隐私保护为此提供解决方案。因此，差分隐私数据合成已成为隐私感知数据共享的一种有前景的方法。本文全面综述了现有的差分隐私表格数据合成方法，重点阐述了在差分隐私约束下，各生成模型针对表格数据生成所面临的独特挑战。我们根据生成模型将这些方法分为基于统计的方法和基于深度学习的方法，并在集中式和分布式环境中进行讨论。我们在每个类别内评估和比较这些方法，突出它们在效用、隐私和计算复杂度方面的优缺点。此外，我们介绍并讨论了评估合成数据质量的各种方法，指出了该领域的研究空白以及未来研究方向。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日