Despite several works that succeed in generating synthetic data with differential privacy (DP) guarantees, they are inadequate for generating high-quality synthetic data when the input data has missing values. In this work, we formalize the problems of DP synthetic data with missing values and propose three effective adaptive strategies that significantly improve the utility of the synthetic data on four real-world datasets with different types and levels of missing data and privacy requirements. We also identify the relationship between privacy impact for the complete ground truth data and incomplete data for these DP synthetic data generation algorithms. We model the missing mechanisms as a sampling process to obtain tighter upper bounds for the privacy guarantees to the ground truth data. Overall, this study contributes to a better understanding of the challenges and opportunities for using private synthetic data generation algorithms in the presence of missing data.
翻译:尽管多项工作成功生成了具有差分隐私(DP)保证的合成数据,但当输入数据存在缺失值时,这些方法难以生成高质量的合成数据。本研究正式定义了含缺失值的差分隐私合成数据问题,并提出了三种有效的自适应策略,显著提升了在四种真实世界数据集上(涵盖不同类型与程度的缺失数据及隐私需求)合成数据的可用性。此外,我们揭示了完整真实数据与不完整数据对上述DP合成数据生成算法隐私影响之间的关系。将缺失机制建模为采样过程,可得到对真实数据隐私保证的更紧上界。总体而言,本研究增进了对在数据缺失场景下使用隐私合成数据生成算法所面临挑战与机遇的理解。