Epistemic Parity: Reproducibility as an Evaluation Metric for Differential Privacy

Lucas Rosenblatt,Bernease Herman,Anastasia Holovenko,Wonkwon Lee,Joshua Loftus,Elizabeth McKinnie,Taras Rumezhak,Andrii Stadnik,Bill Howe,Julia Stoyanovich

from arxiv, Preprint. 14 pages

Differential privacy (DP) data synthesizers support public release of sensitive information, offering theoretical guarantees for privacy but limited evidence of utility in practical settings. Utility is typically measured as the error on representative proxy tasks, such as descriptive statistics, accuracy of trained classifiers, or performance over a query workload. The ability for these results to generalize to practitioners' experience has been questioned in a number of settings, including the U.S. Census. In this paper, we propose an evaluation methodology for synthetic data that avoids assumptions about the representativeness of proxy tasks, instead measuring the likelihood that published conclusions would change had the authors used synthetic data, a condition we call epistemic parity. Our methodology consists of reproducing empirical conclusions of peer-reviewed papers on real, publicly available data, then re-running these experiments a second time on DP synthetic data, and comparing the results. We instantiate our methodology over a benchmark of recent peer-reviewed papers that analyze public datasets in the ICPSR repository. We model quantitative claims computationally to automate the experimental workflow, and model qualitative claims by reproducing visualizations and comparing the results manually. We then generate DP synthetic datasets using multiple state-of-the-art mechanisms, and estimate the likelihood that these conclusions will hold. We find that state-of-the-art DP synthesizers are able to achieve high epistemic parity for several papers in our benchmark. However, some papers, and particularly some specific findings, are difficult to reproduce for any of the synthesizers. We advocate for a new class of mechanisms that favor stronger utility guarantees and offer privacy protection with a focus on application-specific threat models and risk-assessment.

翻译：差分隐私数据合成器支持敏感信息的公开发布，在隐私方面提供理论保障，但在实际应用中的效用证据有限。效用通常通过代表性代理任务的误差来衡量，例如描述性统计、训练分类器的准确性或查询工作负载的性能。这些结果能否推广到实践者的经验中，已在包括美国人口普查在内的多个场景受到质疑。本文提出一种合成数据评估方法，避免对代理任务代表性的假设，转而测量若作者使用合成数据后已发表结论发生变化的可能性——我们称之为认知对等性条件。该方法通过复现同行评审论文基于真实公开数据的实证结论，随后在差分隐私合成数据上重新进行实验并比较结果。我们在ICPSR存储库中近期公开数据集的同行评审论文基准上实例化该方法。通过计算建模定量声明以实现实验流程自动化，对定性声明则通过复现可视化结果并手动比较实现建模。随后使用多种先进机制生成差分隐私合成数据集，并估计这些结论成立的可能性。研究发现，当前最先进的差分隐私合成器能为本基准中的多篇论文达到高认知对等性。然而部分论文（特别是某些具体发现）难以被任何合成器复现。我们主张开发新型机制，应优先提供更强的效用保障，并基于应用特定威胁模型与风险评估提供隐私保护。