Text-to-image diffusion models have been shown to suffer from sample-level memorization, possibly reproducing near-perfect replica of images that they are trained on, which may be undesirable. To remedy this issue, we develop the first differentially private (DP) retrieval-augmented generation algorithm that is capable of generating high-quality image samples while providing provable privacy guarantees. Specifically, we assume access to a text-to-image diffusion model trained on a small amount of public data, and design a DP retrieval mechanism to augment the text prompt with samples retrieved from a private retrieval dataset. Our \emph{differentially private retrieval-augmented diffusion model} (DP-RDM) requires no fine-tuning on the retrieval dataset to adapt to another domain, and can use state-of-the-art generative models to generate high-quality image samples while satisfying rigorous DP guarantees. For instance, when evaluated on MS-COCO, our DP-RDM can generate samples with a privacy budget of $\epsilon=10$, while providing a $3.5$ point improvement in FID compared to public-only retrieval for up to $10,000$ queries.
翻译:文本到图像扩散模型已被证明存在样本级记忆问题,可能复现出与训练图像近乎完美的副本,这通常是不期望的。为解决这一问题,我们提出了首个具备差分隐私保证的检索增强生成算法,能够在生成高质量图像样本的同时提供可证明的隐私保护。具体而言,我们假设能够访问在少量公开数据上训练的文本到图像扩散模型,并设计了一种差分隐私检索机制,通过从私有检索数据集中获取的样本增强文本提示。我们的差分隐私检索增强扩散模型(DP-RDM)无需在检索数据集上进行微调即可适配到其他领域,且能够利用最先进的生成模型在满足严格差分隐私保证的同时生成高质量图像样本。例如,在MS-COCO数据集上评估时,我们的DP-RDM能在隐私预算$\epsilon=10$的条件下生成样本,相比仅使用公开检索的方法,在最多10000次查询中将FID指标提升了3.5点。