Despite several works that succeed in generating synthetic data with differential privacy (DP) guarantees, they are inadequate for generating high-quality synthetic data when the input data has missing values. In this work, we formalize the problems of DP synthetic data with missing values and propose three effective adaptive strategies that significantly improve the utility of the synthetic data on four real-world datasets with different types and levels of missing data and privacy requirements. We also identify the relationship between privacy impact for the complete ground truth data and incomplete data for these DP synthetic data generation algorithms. We model the missing mechanisms as a sampling process to obtain tighter upper bounds for the privacy guarantees to the ground truth data. Overall, this study contributes to a better understanding of the challenges and opportunities for using private synthetic data generation algorithms in the presence of missing data.
翻译:尽管已有若干工作成功生成具有差分隐私(DP)保证的合成数据,但这些方法在输入数据存在缺失值时,难以生成高质量的合成数据。在本工作中,我们形式化了具有缺失值的差分隐私合成数据生成问题,并提出了三种有效的自适应策略。这些策略在四种具有不同缺失类型、缺失程度和隐私要求的真实数据集上,显著提升了合成数据的效用。我们还揭示了这些差分隐私合成数据生成算法对完整真实数据与不完整数据的隐私影响之间的关系。我们将缺失机制建模为抽样过程,从而为真实数据的隐私保证获得了更严格的上界。总体而言,本研究有助于更好地理解在存在缺失数据的情况下使用隐私保护合成数据生成算法所面临的挑战与机遇。