Recent advances in synthetic data generation (SDG) have been hailed as a solution to the difficult problem of sharing sensitive data while protecting privacy. SDG aims to learn statistical properties of real data in order to generate "artificial" data that are structurally and statistically similar to sensitive data. However, prior research suggests that inference attacks on synthetic data can undermine privacy, but only for specific outlier records. In this work, we introduce a new attribute inference attack against synthetic data. The attack is based on linear reconstruction methods for aggregate statistics, which target all records in the dataset, not only outliers. We evaluate our attack on state-of-the-art SDG algorithms, including Probabilistic Graphical Models, Generative Adversarial Networks, and recent differentially private SDG mechanisms. By defining a formal privacy game, we show that our attack can be highly accurate even on arbitrary records, and that this is the result of individual information leakage (as opposed to population-level inference). We then systematically evaluate the tradeoff between protecting privacy and preserving statistical utility. Our findings suggest that current SDG methods cannot consistently provide sufficient privacy protection against inference attacks while retaining reasonable utility. The best method evaluated, a differentially private SDG mechanism, can provide both protection against inference attacks and reasonable utility, but only in very specific settings. Lastly, we show that releasing a larger number of synthetic records can improve utility but at the cost of making attacks far more effective.
翻译:合成数据生成(SDG)的最新进展被誉为在保护隐私的同时共享敏感数据这一难题的解决方案。SDG旨在学习真实数据的统计特性,以生成在结构和统计上与敏感数据相似的"人工"数据。然而,先前的研究表明,对合成数据的推断攻击可能会损害隐私,但仅针对特定的离群记录。在这项工作中,我们提出了一种针对合成数据的新型属性推断攻击。该攻击基于针对聚合统计量的线性重构方法,其目标不仅仅是离群点,而是数据集中的所有记录。我们针对最先进的SDG算法评估了我们的攻击,包括概率图模型、生成对抗网络以及最新的差分隐私SDG机制。通过定义形式化隐私博弈,我们证明了该攻击即使对任意记录也能达到高精度,并且这是个体信息泄露(而非群体层面的推断)的结果。随后,我们系统评估了保护隐私与保持统计效用之间的权衡。我们的研究结果表明,当前的SDG方法无法在保留合理效用的同时,始终如一地提供足够的隐私保护以抵御推断攻击。评估中表现最佳的方法(一种差分隐私SDG机制)既能提供抵御推断攻击的保护,又能保持合理的效用,但仅限于非常特定的设置。最后,我们证明释放更多数量的合成记录可以提高效用,但代价是使攻击效果显著增强。