Quantifying and Mitigating Privacy Risks for Tabular Generative Models

Synthetic data from generative models emerges as the privacy-preserving data-sharing solution. Such a synthetic data set shall resemble the original data without revealing identifiable private information. The backbone technology of tabular synthesizers is rooted in image generative models, ranging from Generative Adversarial Networks (GANs) to recent diffusion models. Recent prior work sheds light on the utility-privacy tradeoff on tabular data, revealing and quantifying privacy risks on synthetic data. We first conduct an exhaustive empirical analysis, highlighting the utility-privacy tradeoff of five state-of-the-art tabular synthesizers, against eight privacy attacks, with a special focus on membership inference attacks. Motivated by the observation of high data quality but also high privacy risk in tabular diffusion, we propose DP-TLDM, Differentially Private Tabular Latent Diffusion Model, which is composed of an autoencoder network to encode the tabular data and a latent diffusion model to synthesize the latent tables. Following the emerging f-DP framework, we apply DP-SGD to train the auto-encoder in combination with batch clipping and use the separation value as the privacy metric to better capture the privacy gain from DP algorithms. Our empirical evaluation demonstrates that DP-TLDM is capable of achieving a meaningful theoretical privacy guarantee while also significantly enhancing the utility of synthetic data. Specifically, compared to other DP-protected tabular generative models, DP-TLDM improves the synthetic quality by an average of 35% in data resemblance, 15% in the utility for downstream tasks, and 50% in data discriminability, all while preserving a comparable level of privacy risk.

翻译：合成数据生成模型作为隐私保护数据共享方案应运而生。这类合成数据集应在不泄露可识别隐私信息的前提下，保留原始数据的特性。表格数据合成器的核心技术源于图像生成模型，涵盖生成对抗网络（GANs）到最新扩散模型等。现有研究揭示了表格数据上效用与隐私的权衡关系，并量化了合成数据中的隐私风险。我们首先开展全面的实证分析，重点针对五种先进表格合成器的效用-隐私权衡进行检验，在八种隐私攻击（特别是成员推断攻击）背景下展开研究。基于表格扩散模型兼具高数据质量与高隐私风险的观察，我们提出差异隐私表格隐空间扩散模型（DP-TLDM），该模型由编码表格数据的自编码器网络与合成隐式表格的隐扩散模型组成。遵循新兴的f-DP框架，我们采用结合批次裁剪的DP-SGD训练自编码器，并以分离值作为隐私度量指标，更精准地捕捉差分隐私算法带来的隐私增益。实验评估表明，DP-TLDM在提供有实际意义的理论隐私保障的同时，显著提升了合成数据的效用。具体而言，与其他受DP保护的表格生成模型相比，DP-TLDM在数据相似度上平均提升35%，在下游任务效用上提升15%，在数据可区分性上提升50%，且保持相当的隐私风险水平。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日