Mechanisms for generating differentially private synthetic data based on marginals and graphical models have been successful in a wide range of settings. However, one limitation of these methods is their inability to incorporate public data. Initializing a data generating model by pre-training on public data has shown to improve the quality of synthetic data, but this technique is not applicable when model structure is not determined a priori. We develop the mechanism jam-pgm, which expands the adaptive measurements framework to jointly select between measuring public data and private data. This technique allows for public data to be included in a graphical-model-based mechanism. We show that jam-pgm is able to outperform both publicly assisted and non publicly assisted synthetic data generation mechanisms even when the public data distribution is biased.
翻译:基于边际分布与图模型的差分隐私合成数据生成机制已在多种场景下取得成功。然而,这些方法的一个局限性在于无法整合公共数据。通过在公共数据上预训练来初始化数据生成模型虽能提升合成数据的质量,但当模型结构无法预先确定时,该技术便不再适用。我们提出jam-pgm机制,该机制扩展了自适应测量框架,能够联合选择对公共数据与私有数据进行测量。这项技术使公共数据能够融入基于图模型的机制中。实验表明,即使公共数据分布存在偏差,jam-pgm机制在性能上仍优于依赖公共辅助数据与不依赖公共辅助数据的合成数据生成机制。