Synthetic data from generative models emerges as the privacy-preserving data-sharing solution. Such a synthetic data set shall resemble the original data without revealing identifiable private information. The backbone technology of tabular synthesizers is rooted in image generative models, ranging from Generative Adversarial Networks (GANs) to recent diffusion models. Recent prior work sheds light on the utility-privacy tradeoff on tabular data, revealing and quantifying privacy risks on synthetic data. We first conduct an exhaustive empirical analysis, highlighting the utility-privacy tradeoff of five state-of-the-art tabular synthesizers, against eight privacy attacks, with a special focus on membership inference attacks. Motivated by the observation of high data quality but also high privacy risk in tabular diffusion, we propose DP-TLDM, Differentially Private Tabular Latent Diffusion Model, which is composed of an autoencoder network to encode the tabular data and a latent diffusion model to synthesize the latent tables. Following the emerging f-DP framework, we apply DP-SGD to train the auto-encoder in combination with batch clipping and use the separation value as the privacy metric to better capture the privacy gain from DP algorithms. Our empirical evaluation demonstrates that DP-TLDM is capable of achieving a meaningful theoretical privacy guarantee while also significantly enhancing the utility of synthetic data. Specifically, compared to other DP-protected tabular generative models, DP-TLDM improves the synthetic quality by an average of 35% in data resemblance, 15% in the utility for downstream tasks, and 50% in data discriminability, all while preserving a comparable level of privacy risk.
翻译:合成数据生成模型作为隐私保护数据共享方案应运而生。这类合成数据集应在不泄露可识别隐私信息的前提下,保留原始数据的特性。表格数据合成器的核心技术源于图像生成模型,涵盖生成对抗网络(GANs)到最新扩散模型等。现有研究揭示了表格数据上效用与隐私的权衡关系,并量化了合成数据中的隐私风险。我们首先开展全面的实证分析,重点针对五种先进表格合成器的效用-隐私权衡进行检验,在八种隐私攻击(特别是成员推断攻击)背景下展开研究。基于表格扩散模型兼具高数据质量与高隐私风险的观察,我们提出差异隐私表格隐空间扩散模型(DP-TLDM),该模型由编码表格数据的自编码器网络与合成隐式表格的隐扩散模型组成。遵循新兴的f-DP框架,我们采用结合批次裁剪的DP-SGD训练自编码器,并以分离值作为隐私度量指标,更精准地捕捉差分隐私算法带来的隐私增益。实验评估表明,DP-TLDM在提供有实际意义的理论隐私保障的同时,显著提升了合成数据的效用。具体而言,与其他受DP保护的表格生成模型相比,DP-TLDM在数据相似度上平均提升35%,在下游任务效用上提升15%,在数据可区分性上提升50%,且保持相当的隐私风险水平。