Many modern statistical analysis and machine learning applications require training models on sensitive user data. Differential privacy provides a formal guarantee that individual-level information about users does not leak. In this framework, randomized algorithms inject calibrated noise into the confidential data, resulting in privacy-protected datasets or queries. However, restricting access to only the privatized data during statistical analysis makes it computationally challenging to perform valid inferences on parameters underlying the confidential data. In this work, we propose simulation-based inference methods from privacy-protected datasets. Specifically, we use neural conditional density estimators as a flexible family of distributions to approximate the posterior distribution of model parameters given the observed private query results. We illustrate our methods on discrete time-series data under an infectious disease model and on ordinary linear regression models. Illustrating the privacy-utility trade-off, our experiments and analysis demonstrate the necessity and feasibility of designing valid statistical inference procedures to correct for biases introduced by the privacy-protection mechanisms.
翻译:现代许多统计分析与机器学习应用需要在敏感用户数据上训练模型。差分隐私提供了用户个体信息不会泄露的形式化保证。在该框架下,随机化算法将校准噪声注入机密数据,从而生成隐私保护的数据集或查询结果。然而,在统计分析中仅允许访问经过隐私处理的数据,使得对底层机密数据的参数进行有效推断在计算上颇具挑战。本研究提出基于隐私保护数据集的模拟推断方法。具体而言,我们采用神经条件密度估计器作为灵活的分布族,来近似给定观测到的私有查询结果时模型参数的后验分布。我们分别以传染病模型下的离散时间序列数据和普通线性回归模型为例展示所提方法。通过实验与分析揭示隐私-效用的权衡,本研究证明了设计有效统计推断流程以修正隐私保护机制引入偏差的必要性与可行性。