Recent advances in synthetic data generation (SDG) have been hailed as a solution to the difficult problem of sharing sensitive data while protecting privacy. SDG aims to learn statistical properties of real data in order to generate "artificial" data that are structurally and statistically similar to sensitive data. However, prior research suggests that inference attacks on synthetic data can undermine privacy, but only for specific outlier records. In this work, we introduce a new attribute inference attack against synthetic data. The attack is based on linear reconstruction methods for aggregate statistics, which target all records in the dataset, not only outliers. We evaluate our attack on state-of-the-art SDG algorithms, including Probabilistic Graphical Models, Generative Adversarial Networks, and recent differentially private SDG mechanisms. By defining a formal privacy game, we show that our attack can be highly accurate even on arbitrary records, and that this is the result of individual information leakage (as opposed to population-level inference). We then systematically evaluate the tradeoff between protecting privacy and preserving statistical utility. Our findings suggest that current SDG methods cannot consistently provide sufficient privacy protection against inference attacks while retaining reasonable utility. The best method evaluated, a differentially private SDG mechanism, can provide both protection against inference attacks and reasonable utility, but only in very specific settings. Lastly, we show that releasing a larger number of synthetic records can improve utility but at the cost of making attacks far more effective.
翻译:合成数据生成(SDG)领域的最新进展被视为解决共享敏感数据同时保护隐私这一难题的解决方案。SDG旨在学习真实数据的统计特性,以生成在结构和统计上与敏感数据相似的"人造"数据。然而,先前研究表明对合成数据的推断攻击可能损害隐私,但仅针对特定的异常值记录。本研究提出了一种针对合成数据的新型属性推断攻击方法。该攻击基于面向聚合统计量的线性重构方法,不仅针对异常值,还覆盖数据集中的所有记录。我们在最先进的SDG算法(包括概率图模型、生成对抗网络及近期提出的差分隐私SDG机制)上评估了该攻击方法。通过定义形式化隐私博弈,我们证明该攻击即使在任意记录上也能达到高准确率,且这是由个体信息泄露(而非群体层面的推断)所导致。随后,我们系统评估了隐私保护与统计效用之间的权衡关系。研究结果表明,当前SDG方法无法在保持合理效用的同时持续提供充分的隐私保护以抵御推断攻击。作为评估中性能最优的方法,差分隐私SDG机制虽能同时提供抗推断攻击能力和合理效用,但这仅在特定条件下成立。最后,我们证明发布更多数量的合成记录虽能提升效用,但会以显著增强攻击效能为代价。