Safe and reliable disclosure of information from confidential data is a challenging statistical problem. A common approach considers the generation of synthetic data, to be disclosed instead of the original data. Efficient approaches ought to deal with the trade-off between reliability and confidentiality of the released data. Ultimately, the aim is to be able to reproduce as accurately as possible statistical analysis of the original data using the synthetic one. Bayesian networks is a model-based approach that can be used to parsimoniously estimate the underlying distribution of the original data and generate synthetic datasets. These ought to not only approximate the results of analyses with the original data but also robustly quantify the uncertainty involved in the approximation. This paper proposes a fully Bayesian approach to generate and analyze synthetic data based on the posterior predictive distribution of statistics of the synthetic data, allowing for efficient uncertainty quantification. The methodology makes use of probability properties of the model to devise a computationally efficient algorithm to obtain the target predictive distributions via Monte Carlo. Model parsimony is handled by proposing a general class of penalizing priors for Bayesian network models. Finally, the efficiency and applicability of the proposed methodology is empirically investigated through simulated and real examples.
翻译:从机密数据中安全可靠地披露信息是一个具有挑战性的统计问题。一种常见方法是生成合成数据以替代原始数据进行披露。高效的方法需要权衡所发布数据的可靠性与机密性。最终目标是能够利用合成数据尽可能准确地复现原始数据的统计分析结果。贝叶斯网络作为一种基于模型的方法,可被用于简约地估计原始数据的潜在分布并生成合成数据集。这些数据集不仅应近似原始数据的分析结果,还应稳健地量化该近似过程中涉及的不确定性。本文提出了一种全贝叶斯方法,基于合成数据统计量的后验预测分布来生成和分析合成数据,从而实现高效的不确定性量化。该方法利用模型概率特性设计了一种计算高效的算法,通过蒙特卡洛方法获得目标预测分布。通过提出一类适用于贝叶斯网络模型的通用惩罚先验,解决了模型简约性问题。最后,通过模拟与真实案例实证研究了所提方法的效率与适用性。