Diffusion-based text-to-image (T2I) models generate high-fidelity images for given textual prompts. They are trained on large datasets scraped from the Internet, potentially containing unacceptable concepts (e.g., copyright infringing or unsafe). Retraining T2I models after filtering out unacceptable concepts in the training data is inefficient and degrades utility. Hence, there is a need for concept removal techniques (CRTs) which are effective in removing unacceptable concepts, utility-preserving on acceptable concepts, and robust against evasion with adversarial prompts. None of the prior filtering and fine-tuning CRTs satisfy all these requirements simultaneously. We introduce Espresso, the first robust concept filter based on Contrastive Language-Image Pre-Training (CLIP). It identifies unacceptable concepts by projecting the generated image's embedding onto the vector connecting unacceptable and acceptable concepts in the joint text-image embedding space. This ensures robustness by restricting the adversary to adding noise only along this vector, in the direction of the acceptable concept. Further fine-tuning Espresso to separate embeddings of acceptable and unacceptable concepts, while preserving their pairing with image embeddings, ensures both effectiveness and utility. We evaluate Espresso on eleven concepts to show that it is effective (~5% CLIP accuracy on unacceptable concepts), utility-preserving (~93% normalized CLIP score on acceptable concepts), and robust (~4% CLIP accuracy on adversarial prompts for unacceptable concepts). Finally, we present theoretical bounds for the certified robustness of Espresso against adversarial prompts, and an empirical analysis.
翻译:基于扩散的文本到图像(T2I)模型能够根据给定文本提示生成高保真图像。这些模型通过从互联网抓取的大规模数据集进行训练,数据中可能包含不可接受概念(例如侵权或不安全内容)。在过滤训练数据中的不可接受概念后重新训练T2I模型效率低下且会降低实用性。因此,需要能够有效移除不可接受概念、保持可接受概念实用性、并抵御对抗性提示规避行为的概念移除技术(CRT)。现有过滤与微调类CRT均无法同时满足所有要求。本文提出Espresso——首个基于对比语言-图像预训练(CLIP)的鲁棒概念过滤器。该方法通过将生成图像的嵌入向量投射至联合文本-图像嵌入空间中连接不可接受概念与可接受概念的向量上,从而识别不可接受概念。通过限制对抗者仅能沿该向量向可接受概念方向添加噪声,确保鲁棒性。进一步微调Espresso以分离可接受与不可接受概念的嵌入,同时保持其与图像嵌入的配对关系,从而兼顾有效性与实用性。我们在十一个概念上评估Espresso,结果表明其具有高效性(不可接受概念的CLIP准确率约5%)、实用性(可接受概念标准化CLIP得分约93%)和鲁棒性(对抗性提示下不可接受概念的CLIP准确率约4%)。最后,我们给出Espresso对抗性提示认证鲁棒性的理论界及实证分析。