We develop both theory and algorithms to analyze privatized data in the unbounded differential privacy(DP), where even the sample size is considered a sensitive quantity that requires privacy protection. We show that the distance between the sampling distributions under unbounded DP and bounded DP goes to zero as the sample size $n$ goes to infinity, provided that the noise used to privatize $n$ is at an appropriate rate; we also establish that ABC-type posterior distributions converge under similar assumptions. We further give asymptotic results in the regime where the privacy budget for $n$ goes to zero, establishing similarity of sampling distributions as well as showing that the MLE in the unbounded setting converges to the bounded-DP MLE. In order to facilitate valid, finite-sample Bayesian inference on privatized data in the unbounded DP setting, we propose a reversible jump MCMC algorithm which extends the data augmentation MCMC of Ju et al. (2022). We also propose a Monte Carlo EM algorithm to compute the MLE from privatized data in both bounded and unbounded DP. We apply our methodology to analyze a linear regression model as well as a 2019 American Time Use Survey Microdata File which we model using a Dirichlet distribution.
翻译:本文针对无界差分隐私(DP)场景下的私有化数据分析,发展了相应的理论与算法框架,其中连样本量本身也被视为需要隐私保护的敏感信息。我们证明,当用于隐私化样本量$n$的噪声满足适当衰减速率时,无界DP与有界DP下抽样分布间的距离随样本量$n$趋于无穷而收敛至零;同时在类似假设下建立了ABC类后验分布的收敛性。进一步地,在针对$n$的隐私预算趋于零的机制下,我们给出了渐近理论结果:既证明了抽样分布的渐近相似性,也表明无界设定下的极大似然估计量收敛于有界DP的极大似然估计量。为在无界DP设定下对私有化数据实现有效的有限样本贝叶斯推断,我们提出了一种可逆跳转MCMC算法,该算法拓展了Ju等人(2022)的数据增强MCMC框架。同时,我们设计了蒙特卡洛EM算法,用于在有界与无界DP下从私有化数据中计算极大似然估计。我们将所提出的方法应用于线性回归模型分析,以及对采用狄利克雷分布建模的2019年美国时间利用调查微观数据文件进行统计分析。