Many modern statistical analysis and machine learning applications require training models on sensitive user data. Differential privacy provides a formal guarantee that individual-level information about users does not leak. In this framework, randomized algorithms inject calibrated noise into the confidential data, resulting in privacy-protected datasets or queries. However, restricting access to only privatized data during statistical analysis makes it computationally challenging to make valid inferences on the parameters underlying the confidential data. In this work, we propose simulation-based inference methods from privacy-protected datasets. In addition to sequential Monte Carlo approximate Bayesian computation, we use neural conditional density estimators as a flexible family of distributions to approximate the posterior distribution of model parameters given the observed private query results. We illustrate our methods on discrete time-series data under an infectious disease model and with ordinary linear regression models. Illustrating the privacy-utility trade-off, our experiments and analysis demonstrate the necessity and feasibility of designing valid statistical inference procedures to correct for biases introduced by the privacy-protection mechanisms.
翻译:许多现代统计分析和机器学习应用需要在敏感用户数据上训练模型。差分隐私提供了一种形式化保证,确保关于用户的个体级信息不会泄露。在该框架中,随机化算法向机密数据注入校准噪声,从而生成受隐私保护的数据集或查询结果。然而,在统计分析过程中仅能访问经过隐私处理的数据,导致在基于机密数据推断参数时面临计算上的挑战。在本研究中,我们提出了基于隐私保护数据集的模拟推断方法。除了采用序贯蒙特卡洛近似贝叶斯计算外,我们还使用神经条件密度估计器作为灵活的概率分布族,以近似在观测到私有查询结果条件下模型参数的后验分布。我们通过传染病模型下的离散时间序列数据以及普通线性回归模型对所提方法进行了验证。通过展示隐私-效用的权衡,我们的实验与分析证明了设计有效统计推断程序以纠正隐私保护机制引入偏差的必要性和可行性。