Generative models trained with Differential Privacy (DP) are increasingly used to produce synthetic data while reducing privacy risks. Navigating their specific privacy-utility tradeoffs makes it challenging to determine which models would work best for specific settings/tasks. In this paper, we fill this gap in the context of tabular data by analyzing how DP generative models distribute privacy budgets across rows and columns, arguably the main source of utility degradation. We examine the main factors contributing to how privacy budgets are spent, including underlying modeling techniques, DP mechanisms, and data dimensionality. Our extensive evaluation of both graphical and deep generative models sheds light on the distinctive features that render them suitable for different settings and tasks. We show that graphical models distribute the privacy budget horizontally and thus cannot handle relatively wide datasets while the performance on the task they were optimized for monotonically increases with more data. Deep generative models spend their budget per iteration, so their behavior is less predictable with varying dataset dimensions but could perform better if trained on more features. Also, low levels of privacy ($\epsilon\geq100$) could help some models generalize, achieving better results than without applying DP.
翻译:使用差分隐私训练的生成模型越来越多地用于生成合成数据,同时降低隐私风险。由于它们特定的隐私-效用权衡,很难确定哪些模型最适合特定场景/任务。在本文中,我们通过分析差分隐私生成模型如何在行和列之间分配隐私预算(这可以说是效用下降的主要来源)来填补表格数据背景下的这一空白。我们探讨了影响隐私预算分配的主要因素,包括基础建模技术、差分隐私机制和数据维度。我们对图形模型和深度生成模型的广泛评估揭示了它们适用于不同场景和任务的独特特征。我们表明,图形模型水平分配隐私预算,因此无法处理相对较宽的数据集,而它们在优化任务上的性能随着数据的增加而单调提升。深度生成模型每次迭代花费其预算,因此它们的行为随着数据集维度的变化而较难预测,但如果基于更多特征进行训练,可能会表现更好。此外,低隐私水平(ε≥100)可能有助于某些模型泛化,达到比未应用差分隐私时更好的结果。