Achilles' Heels: Vulnerable Record Identification in Synthetic Data Publishing

Synthetic data is seen as the most promising solution to share individual-level data while preserving privacy. Shadow modeling-based membership inference attacks (MIAs) have become the standard approach to evaluate the privacy risk of synthetic data. While very effective, they require a large number of datasets to be created and models trained to evaluate the risk posed by a single record. The privacy risk of a dataset is thus currently evaluated by running MIAs on a handful of records selected using ad-hoc methods. We here propose what is, to the best of our knowledge, the first principled vulnerable record identification technique for synthetic data publishing, leveraging the distance to a record's closest neighbors. We show our method to strongly outperform previous ad-hoc methods across datasets and generators. We also show evidence of our method to be robust to the choice of MIA and to specific choice of parameters. Finally, we show it to accurately identify vulnerable records when synthetic data generators are made differentially private. The choice of vulnerable records is as important as more accurate MIAs when evaluating the privacy of synthetic data releases, including from a legal perspective. We here propose a simple yet highly effective method to do so. We hope our method will enable practitioners to better estimate the risk posed by synthetic data publishing and researchers to fairly compare ever improving MIAs on synthetic data.

翻译：合成数据被视为在保护隐私的同时共享个体级数据的最有前景的方案。基于影子模型的成员推理攻击（MIAs）已成为评估合成数据隐私风险的标准方法。尽管这些方法非常有效，但它们需要创建大量数据集并训练模型来评估单个记录所带来的风险。因此，目前数据集的隐私风险是通过对使用临时方法选出的少量记录运行MIAs来评估的。本文提出了据我们所知首个针对合成数据发布的、基于记录最近邻距离的规范化脆弱记录识别技术。我们证明了该方法在跨数据集和生成器上的表现显著优于以往的临时方法。我们还展示了该方法对MIA选择及特定参数具有较强的鲁棒性。最后，我们证明当合成数据生成器实现差分隐私时，该方法能准确识别脆弱记录。在评估合成数据发布的隐私性时（包括从法律角度），脆弱记录的选择与更精确的MIAs同样重要。我们在此提出了一种简单但高效的方法来实现这一目标。希望我们的方法能使从业者更好地评估合成数据发布的潜在风险，并使研究人员能够公平比较不断改进的针对合成数据的MIAs。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日