Epistemic Parity: Reproducibility as an Evaluation Metric for Differential Privacy

Lucas Rosenblatt,Bernease Herman,Anastasia Holovenko,Wonkwon Lee,Joshua Loftus,Elizabeth McKinnie Taras Rumezhak,Andrii Stadnik,Bill Howe,Julia Stoyanovich

from arxiv, Preprint. 14 pages

Differential privacy (DP) mechanisms are increasingly proposed to afford public release of sensitive information, offering strong theoretical guarantees for privacy, yet limited empirical evidence of utility. Utility is typically measured as the error on representative proxy tasks, such as descriptive statistics or performance over a query workload. The ability for these results to generalize to practitioners' experience has been questioned in a number of settings, including the U.S. Census. In this paper, we propose an evaluation methodology for synthetic data that avoids assumptions about the representativeness of proxy tasks, instead measuring the likelihood that published conclusions would change had the authors used synthetic data, a condition we call epistemic parity. We instantiate our methodology over a benchmark of recent peer-reviewed papers that analyze public datasets in the ICPSR social science repository. We model quantitative claims computationally to automate the experimental workflow, and model qualitative claims by reproducing visualizations and comparing the results manually. We then generate DP synthetic datasets using multiple state-of-the-art mechanisms, and estimate the likelihood that these conclusions will hold. We find that, for reasonable privacy regimes, state-of-the-art DP synthesizers are able to achieve high epistemic parity for several papers in our benchmark. However, some papers, and particularly some specific findings, are difficult to reproduce for any of the synthesizers. Given these results, we advocate for a new class of mechanisms that can reorder the priorities for DP data synthesis: favor stronger guarantees for utility (as measured by epistemic parity) and offer privacy protection with a focus on application-specific threat models and risk-assessment.

翻译：差分隐私（DP）机制被越来越多地提出用于公开发布敏感信息，其在隐私保护方面具有强大的理论保障，但在实用性方面的实证证据却有限。通常，实用性是通过代表性代理任务的误差来衡量的，例如描述性统计或查询工作负载的性能。这些结果能否推广至实践者的经验，已在包括美国人口普查在内的多个场景中受到质疑。本文提出了一种针对合成数据的评估方法，该方法避免了对代理任务代表性的假设，转而衡量如果作者使用合成数据，已发表结论发生改变的可能性——我们将这一条件称为认识论等价性。我们将该方法应用于一个基准测试，该基准包含近期经过同行评审的论文，这些论文分析了ICPSR社会科学存储库中的公开数据集。我们通过计算方式对定量主张进行建模，以实现实验流程的自动化；同时通过复现可视化结果并手动对比，对定性主张进行建模。随后，我们使用多种最先进机制生成DP合成数据集，并估计这些结论仍然成立的可能性。研究发现，在合理的隐私治理下，最先进的DP合成器能够在我们基准中的多篇论文上实现较高的认识论等价性。然而，部分论文（尤其是某些特定发现）难以被任何合成器复现。基于这些结果，我们倡导一类新型机制，能够重新排序DP数据合成的优先级：优先保障以认识论等价性衡量的实用性，并在隐私保护中关注特定应用威胁模型与风险评估。