Diffusion based text-to-image models are trained on large datasets scraped from the Internet, potentially containing unacceptable concepts (e.g., copyright-infringing or unsafe). We need concept removal techniques (CRTs) which are i) effective in preventing the generation of images with unacceptable concepts, ii) utility-preserving on acceptable concepts, and, iii) robust against evasion with adversarial prompts. No prior CRT satisfies all these requirements simultaneously. We introduce Espresso, the first robust concept filter based on Contrastive Language-Image Pre-Training (CLIP). We identify unacceptable concepts by using the distance between the embedding of a generated image to the text embeddings of both unacceptable and acceptable concepts. This lets us fine-tune for robustness by separating the text embeddings of unacceptable and acceptable concepts while preserving utility. We present a pipeline to evaluate various CRTs to show that Espresso is more effective and robust than prior CRTs, while retaining utility.
翻译:基于扩散的文本到图像模型在从互联网爬取的大规模数据集上进行训练,这些数据可能包含不可接受的概念(例如侵犯版权或不安全的内容)。我们需要满足以下条件的概念移除技术(CRT):i) 能有效防止生成包含不可接受概念的图像;ii) 在可接受概念上保持效用;iii) 能抵御对抗性提示的规避攻击。现有CRT方法无法同时满足所有这些要求。本文提出Espresso,首个基于对比语言-图像预训练(CLIP)的鲁棒概念过滤器。我们通过计算生成图像的嵌入向量与不可接受/可接受概念文本嵌入向量之间的距离来识别不可接受概念。该方法允许我们通过分离不可接受与可接受概念的文本嵌入向量来微调鲁棒性,同时保持模型效用。我们构建了评估多种CRT的流程,实验表明Espresso相比现有方法具有更高的有效性与鲁棒性,且能保持效用。