Generating synthetic data, with or without differential privacy, has attracted significant attention as a potential solution to the dilemma between making data easily available, and the privacy of data subjects. Several works have shown that consistency of downstream analyses from synthetic data, including accurate uncertainty estimation, requires accounting for the synthetic data generation. There are very few methods of doing so, most of them for frequentist analysis. In this paper, we study how to perform consistent Bayesian inference from synthetic data. We prove that mixing posterior samples obtained separately from multiple large synthetic data sets converges to the posterior of the downstream analysis under standard regularity conditions when the analyst's model is compatible with the data provider's model. We also present several examples showing how the theory works in practice, and showing how Bayesian inference can fail when the compatibility assumption is not met, or the synthetic data set is not significantly larger than the original.
翻译:生成带有或不带有差分隐私的合成数据,作为一种在数据易于获取与数据主体隐私保护之间寻求平衡的潜在解决方案,已引起广泛关注。多项研究表明,从合成数据中进行下游分析(包括准确的置信度估计)需要考虑到合成数据的生成过程,以实现一致性。目前实现这一目标的方法很少,且大多集中于频率学派分析。本文研究了如何从合成数据中进行一致贝叶斯推断。我们证明,当分析模型与数据提供者的模型兼容时,在标准正则条件下,将分别从多个大型合成数据集中获得的贝叶斯后验样本混合,会收敛至下游分析的后验分布。我们还提供了多个实例,说明该理论在实际中的运作方式,并展示当兼容性假设不满足或合成数据集规模未显著大于原始数据集时,贝叶斯推断可能失败的情形。