Text-to-image diffusion models have been shown to suffer from sample-level memorization, possibly reproducing near-perfect replica of images that they are trained on, which may be undesirable. To remedy this issue, we develop the first differentially private (DP) retrieval-augmented generation algorithm that is capable of generating high-quality image samples while providing provable privacy guarantees. Specifically, we assume access to a text-to-image diffusion model trained on a small amount of public data, and design a DP retrieval mechanism to augment the text prompt with samples retrieved from a private retrieval dataset. Our \emph{differentially private retrieval-augmented diffusion model} (DP-RDM) requires no fine-tuning on the retrieval dataset to adapt to another domain, and can use state-of-the-art generative models to generate high-quality image samples while satisfying rigorous DP guarantees. For instance, when evaluated on MS-COCO, our DP-RDM can generate samples with a privacy budget of $\epsilon=10$, while providing a $3.5$ point improvement in FID compared to public-only retrieval for up to $10,000$ queries.
翻译:文本到图像的扩散模型已被证明存在样本级记忆问题,可能重现出训练图像近乎完美的复制版本,这或许是不被期望的。为解决此问题,我们开发了首个具备差分隐私(DP)保障的检索增强生成算法,该算法能够在提供可证明的隐私保护的同时生成高质量图像样本。具体而言,我们假设可访问一个在少量公开数据上训练的文本到图像扩散模型,并设计了一个差分隐私检索机制,用于从私有检索数据集中增强文本提示的样本。我们的差分隐私检索增强扩散模型(DP-RDM)无需在检索数据集上进行微调即可适配至其他域,并能利用最先进的生成模型在满足严格差分隐私保障的同时生成高质量图像样本。例如,在MS-COCO上评估时,我们的DP-RDM可在隐私预算为$\epsilon=10$的条件下生成样本,与仅使用公开数据检索相比,在多达10,000次查询中实现了FID指标3.5分的改进。