Safe and reliable disclosure of information from confidential data is a challenging statistical problem. A common approach considers the generation of synthetic data, to be disclosed instead of the original data. Efficient approaches ought to deal with the trade-off between reliability and confidentiality of the released data. Ultimately, the aim is to be able to reproduce as accurately as possible statistical analysis of the original data using the synthetic one. Bayesian networks is a model-based approach that can be used to parsimoniously estimate the underlying distribution of the original data and generate synthetic datasets. These ought to not only approximate the results of analyses with the original data but also robustly quantify the uncertainty involved in the approximation. This paper proposes a fully Bayesian approach to generate and analyze synthetic data based on the posterior predictive distribution of statistics of the synthetic data, allowing for efficient uncertainty quantification. The methodology makes use of probability properties of the model to devise a computationally efficient algorithm to obtain the target predictive distributions via Monte Carlo. Model parsimony is handled by proposing a general class of penalizing priors for Bayesian network models. Finally, the efficiency and applicability of the proposed methodology is empirically investigated through simulated and real examples.
翻译:从机密数据中安全可靠地披露信息是一个具有挑战性的统计学问题。常见方法考虑生成合成数据以替代原始数据进行披露。高效方法需要平衡所发布数据的可靠性与保密性之间的权衡。最终目标在于能够尽可能准确地利用合成数据复现原始数据的统计分析结果。贝叶斯网络作为一种基于模型的方法,可简约估计原始数据的潜在分布并生成合成数据集。这些合成数据不仅需要近似原始数据分析结果,还需稳健量化近似过程中所涉及的不确定性。本文提出了一种全贝叶斯方法,基于合成数据统计量的后验预测分布来生成和分析合成数据,从而进行高效的不确定性量化。该方法利用模型的概率特性,设计了一种通过蒙特卡洛方法高效获取目标预测分布的计算算法。通过为贝叶斯网络模型提出一类通用的惩罚先验分布来处理模型简约性问题。最后,通过模拟和实际案例对所提方法的效率和适用性进行了实证研究。