Synthetic data is seen as the most promising solution to share individual-level data while preserving privacy. Shadow modeling-based membership inference attacks (MIAs) have become the standard approach to evaluate the privacy risk of synthetic data. While very effective, they require a large number of datasets to be created and models trained to evaluate the risk posed by a single record. The privacy risk of a dataset is thus currently evaluated by running MIAs on a handful of records selected using ad-hoc methods. We here propose what is, to the best of our knowledge, the first principled vulnerable record identification technique for synthetic data publishing, leveraging the distance to a record's closest neighbors. We show our method to strongly outperform previous ad-hoc methods across datasets and generators. We also show evidence of our method to be robust to the choice of MIA and to specific choice of parameters. Finally, we show it to accurately identify vulnerable records when synthetic data generators are made differentially private. The choice of vulnerable records is as important as more accurate MIAs when evaluating the privacy of synthetic data releases, including from a legal perspective. We here propose a simple yet highly effective method to do so. We hope our method will enable practitioners to better estimate the risk posed by synthetic data publishing and researchers to fairly compare ever improving MIAs on synthetic data.
翻译:合成数据被视为在保护隐私的同时共享个体级数据的最有前景的方案。基于影子模型的成员推理攻击(MIAs)已成为评估合成数据隐私风险的标准方法。尽管这些方法非常有效,但它们需要创建大量数据集并训练模型来评估单个记录所带来的风险。因此,目前数据集的隐私风险是通过对使用临时方法选出的少量记录运行MIAs来评估的。本文提出了据我们所知首个针对合成数据发布的、基于记录最近邻距离的规范化脆弱记录识别技术。我们证明了该方法在跨数据集和生成器上的表现显著优于以往的临时方法。我们还展示了该方法对MIA选择及特定参数具有较强的鲁棒性。最后,我们证明当合成数据生成器实现差分隐私时,该方法能准确识别脆弱记录。在评估合成数据发布的隐私性时(包括从法律角度),脆弱记录的选择与更精确的MIAs同样重要。我们在此提出了一种简单但高效的方法来实现这一目标。希望我们的方法能使从业者更好地评估合成数据发布的潜在风险,并使研究人员能够公平比较不断改进的针对合成数据的MIAs。