Data sharing is a prerequisite for collaborative innovation, enabling organizations to leverage diverse datasets for deeper insights. In real-world applications like FinTech and Smart Manufacturing, transactional data, often in tabular form, are generated and analyzed for insight generation. However, such datasets typically contain sensitive personal/business information, raising privacy concerns and regulatory risks. Data synthesis tackles this by generating artificial datasets that preserve the statistical characteristics of real data, removing direct links to individuals. However, attackers can still infer sensitive information using background knowledge. Differential privacy offers a solution by providing provable and quantifiable privacy protection. Consequently, differentially private data synthesis has emerged as a promising approach to privacy-aware data sharing. This paper provides a comprehensive overview of existing differentially private tabular data synthesis methods, highlighting the unique challenges of each generation model for generating tabular data under differential privacy constraints. We classify the methods into statistical and deep learning-based approaches based on their generation models, discussing them in both centralized and distributed environments. We evaluate and compare those methods within each category, highlighting their strengths and weaknesses in terms of utility, privacy, and computational complexity. Additionally, we present and discuss various evaluation methods for assessing the quality of the synthesized data, identify research gaps in the field and directions for future research.
翻译:数据共享是协同创新的先决条件,使组织能够利用多样化数据集获得更深入的洞察。在金融科技和智能制造等实际应用中,通常以表格形式存在的交易数据被生成并用于分析洞察。然而,此类数据集通常包含敏感的个人/商业信息,引发隐私担忧和监管风险。数据合成通过生成保持真实数据统计特征的人工数据集来解决这一问题,消除与个体的直接关联。但攻击者仍可利用背景知识推断敏感信息。差分隐私通过提供可证明且可量化的隐私保护为此提供解决方案。因此,差分隐私数据合成已成为隐私感知数据共享的一种有前景的方法。本文全面综述了现有的差分隐私表格数据合成方法,重点阐述了在差分隐私约束下,各生成模型针对表格数据生成所面临的独特挑战。我们根据生成模型将这些方法分为基于统计的方法和基于深度学习的方法,并在集中式和分布式环境中进行讨论。我们在每个类别内评估和比较这些方法,突出它们在效用、隐私和计算复杂度方面的优缺点。此外,我们介绍并讨论了评估合成数据质量的各种方法,指出了该领域的研究空白以及未来研究方向。