We propose TVineSynth, a vine copula based synthetic tabular data generator, which is designed to balance privacy and utility, using the vine tree structure and its truncation to do the trade-off. Contrary to synthetic data generators that achieve DP by globally adding noise, TVineSynth performs a controlled approximation of the estimated data generating distribution, so that it does not suffer from poor utility of the resulting synthetic data for downstream prediction tasks. TVineSynth introduces a targeted bias into the vine copula model that, combined with the specific tree structure of the vine, causes the model to zero out privacy-leaking dependencies while relying on those that are beneficial for utility. Privacy is here measured with membership (MIA) and attribute inference attacks (AIA). Further, we theoretically justify how the construction of TVineSynth ensures AIA privacy under a natural privacy measure for continuous sensitive attributes. When compared to competitor models, with and without DP, on simulated and on real-world data, TVineSynth achieves a superior privacy-utility balance.
翻译:本文提出TVineSynth——一种基于藤Copula的合成表格数据生成器,其通过藤结构及其截断机制实现隐私性与可用性的权衡。与通过全局添加噪声实现差分隐私的合成数据生成器不同,TVineSynth对估计的数据生成分布进行受控近似,从而避免生成数据在下游预测任务中出现可用性不足的问题。TVineSynth在藤Copula模型中引入定向偏差,结合藤的特定树状结构,使模型能够消除导致隐私泄露的依赖关系,同时保留对可用性有益的依赖。隐私性通过成员推理攻击(MIA)和属性推理攻击(AIA)进行量化。此外,我们从理论上论证了TVineSynth的构建如何确保连续敏感属性在自然隐私度量下具有AIA隐私性。在模拟数据和真实数据上,与包含及不包含差分隐私的对比模型相比,TVineSynth均实现了更优的隐私-可用性平衡。