Differentially private (DP) image synthesis aims to generate artificial images that retain the properties of sensitive images while protecting the privacy of individual images within the dataset. Despite recent advancements, we find that inconsistent--and sometimes flawed--evaluation protocols have been applied across studies. This not only impedes the understanding of current methods but also hinders future advancements. To address the issue, this paper introduces DPImageBench for DP image synthesis, with thoughtful design across several dimensions: (1) Methods. We study eleven prominent methods and systematically characterize each based on model architecture, pretraining strategy, and privacy mechanism. (2) Evaluation. We include nine datasets and seven fidelity and utility metrics to thoroughly assess them. Notably, we find that a common practice of selecting downstream classifiers based on the highest accuracy on the sensitive test set not only violates DP but also overestimates the utility scores. DPImageBench corrects for these mistakes. (3) Platform. Despite the methods and evaluation protocols, DPImageBench provides a standardized interface that accommodates current and future implementations within a unified framework. With DPImageBench, we have several noteworthy findings. For example, contrary to the common wisdom that pretraining on public image datasets is usually beneficial, we find that the distributional similarity between pretraining and sensitive images significantly impacts the performance of the synthetic images and does not always yield improvements. In addition, adding noise to low-dimensional features, such as the high-level characteristics of sensitive images, is less affected by the privacy budget compared to adding noise to high-dimensional features, like weight gradients. The former methods perform better than the latter under a low privacy budget.
翻译:差分隐私(DP)图像合成旨在生成能够保留敏感图像特性,同时保护数据集中个体图像隐私的人工图像。尽管近期取得进展,我们发现不同研究采用了不一致——有时甚至存在缺陷——的评估方案。这不仅阻碍了对现有方法的理解,也制约了未来的发展。为解决这一问题,本文针对DP图像合成提出DPImageBench,在多个维度进行了周密设计:(1)方法层面。我们研究了十一种主流方法,并依据模型架构、预训练策略和隐私机制对每种方法进行了系统化表征。(2)评估体系。我们涵盖九个数据集和七项保真度与效用指标进行全面评估。值得注意的是,我们发现基于敏感测试集最高准确率选择下游分类器的常见做法不仅违反DP原则,还会高估效用分数。DPImageBench修正了这些错误。(3)平台架构。除方法与评估方案外,DPImageBench提供了标准化接口,可在统一框架内兼容当前及未来的实现方案。通过DPImageBench,我们获得了若干重要发现。例如,与“公共图像数据集预训练通常有益”的普遍认知相反,我们发现预训练图像与敏感图像之间的分布相似性显著影响合成图像性能,且并不总能带来改进。此外,相较于对权重梯度等高维特征添加噪声,对敏感图像高层特征等低维特征添加噪声受隐私预算影响更小。在低隐私预算条件下,前者表现优于后者。