Generating synthetic data, with or without differential privacy, has attracted significant attention as a potential solution to the dilemma between making data easily available, and the privacy of data subjects. Several works have shown that consistency of downstream analyses from synthetic data, including accurate uncertainty estimation, requires accounting for the synthetic data generation. There are very few methods of doing so, most of them for frequentist analysis. In this paper, we study how to perform consistent Bayesian inference from synthetic data. We prove that mixing posterior samples obtained separately from multiple large synthetic datasets converges to the posterior of the downstream analysis under standard regularity conditions when the analyst's model is compatible with the data provider's model. We show experimentally that this works in practice, unlocking consistent Bayesian inference from synthetic data while reusing existing downstream analysis methods.
翻译:生成(含或不含差分隐私的)合成数据,作为平衡数据易获取性与数据主体隐私需求这一困境的潜在解决方案,已引起广泛关注。多项研究表明,为保证下游分析(包括准确的不确定性估计)的一致性,必须考虑合成数据的生成过程。目前仅有极少数方法可解决此问题,且多数适用于频率学派分析。本文旨在研究如何从合成数据中实现一致性贝叶斯推断。我们证明:当分析者模型与数据提供者模型兼容时,在标准正则性条件下,将多个独立生成的大规模合成数据中分别获取的后验样本进行混合,可收敛至下游分析的真实后验分布。实验表明,该方法在实践中有效,可在复用现有下游分析方法的同时,实现基于合成数据的一致性贝叶斯推断。