Diffusion based text-to-image models are trained on large datasets scraped from the Internet, potentially containing unacceptable concepts (e.g., copyright infringing or unsafe). We need concept removal techniques (CRTs) which are effective in preventing the generation of images with unacceptable concepts, utility-preserving on acceptable concepts, and robust against evasion with adversarial prompts. None of the prior CRTs satisfy all these requirements simultaneously. We introduce Espresso, the first robust concept filter based on Contrastive Language-Image Pre-Training (CLIP). We configure CLIP to identify unacceptable concepts in generated images using the distance of their embeddings to the text embeddings of both unacceptable and acceptable concepts. This lets us fine-tune for robustness by separating the text embeddings of unacceptable and acceptable concepts while preserving their pairing with image embeddings for utility. We present a pipeline to evaluate various CRTs, attacks against them, and show that Espresso, is more effective and robust than prior CRTs, while retaining utility.
翻译:基于扩散的文本到图像模型在从互联网抓取的大规模数据集上进行训练,这些数据可能包含不可接受的概念(例如,侵犯版权或不安全的内容)。我们需要概念移除技术(CRTs),该技术需能有效防止生成包含不可接受概念的图像,在可接受概念上保持效用,并能抵御对抗性提示的规避攻击。现有的CRTs均无法同时满足所有这些要求。本文提出Espresso,这是首个基于对比语言-图像预训练(CLIP)的鲁棒概念过滤器。我们通过配置CLIP,利用生成图像的嵌入与不可接受及可接受概念的文本嵌入之间的距离,来识别图像中的不可接受概念。这使我们能够通过分离不可接受与可接受概念的文本嵌入,同时保持它们与图像嵌入的配对以维持效用,从而对模型进行鲁棒性微调。我们提出了一套评估各种CRTs及其对抗攻击的流程,并证明Espresso比现有CRTs更有效、更鲁棒,同时保持了效用。