The purpose of anonymizing structured data is to protect the privacy of individuals in the data while retaining the statistical properties of the data. There is a large body of work that examines anonymization vulnerabilities. Focusing on strong anonymization mechanisms, this paper examines a number of prominent attack papers and finds several problems, all of which lead to overstating risk. First, some papers fail to establish a correct statistical inference baseline (or any at all), leading to incorrect measures. Notably, the reconstruction attack from the US Census Bureau that led to a redesign of its disclosure method made this mistake. We propose the non-member framework, an improved method for how to compute a more accurate inference baseline, and give examples of its operation. Second, some papers don't use a realistic membership base rate, leading to incorrect precision measures if precision is reported. Third, some papers unnecessarily report measures in such a way that it is difficult or impossible to assess risk. Virtually the entire literature on membership inference attacks, dozens of papers, make one or both of these errors. We propose that membership inference papers report precision/recall values using a representative range of base rates.
翻译:结构化数据匿名化的目标是在保护数据中个体隐私的同时保留数据的统计特性。已有大量研究探讨匿名化脆弱性问题。本文聚焦于强匿名化机制,对若干重要攻击论文进行审视,发现其中存在多个问题,均导致风险被高估。首先,部分论文未能建立正确的统计推断基线(甚至完全未建立),从而导致度量方法失当。值得注意的是,美国人口普查局在导致其披露方法重新设计的重构攻击研究中便犯下了这一错误。我们提出非成员框架这一改进方法以计算更准确的推断基线,并给出了操作示例。其次,部分论文未采用现实的成员基础比率,导致在报告精度指标时产生偏差。第三,部分论文以难以或无法评估风险的方式报告度量结果。几乎所有关于成员推断攻击的文献(数十篇论文)都存在上述一个或两个错误。我们建议成员推断类论文应采用具有代表性范围的基础比率来报告精确率/召回率指标。