Espresso: Robust Concept Filtering in Text-to-Image Models

Diffusion-based text-to-image (T2I) models generate high-fidelity images for given textual prompts. They are trained on large datasets scraped from the Internet, potentially containing unacceptable concepts (e.g., copyright infringing or unsafe). Retraining T2I models after filtering out unacceptable concepts in the training data is inefficient and degrades utility. Hence, there is a need for concept removal techniques (CRTs) which are effective in removing unacceptable concepts, utility-preserving on acceptable concepts, and robust against evasion with adversarial prompts. None of the prior filtering and fine-tuning CRTs satisfy all these requirements simultaneously. We introduce Espresso, the first robust concept filter based on Contrastive Language-Image Pre-Training (CLIP). It identifies unacceptable concepts by projecting the generated image's embedding onto the vector connecting unacceptable and acceptable concepts in the joint text-image embedding space. This ensures robustness by restricting the adversary to adding noise only along this vector, in the direction of the acceptable concept. Further fine-tuning Espresso to separate embeddings of acceptable and unacceptable concepts, while preserving their pairing with image embeddings, ensures both effectiveness and utility. We evaluate Espresso on eleven concepts to show that it is effective (~5% CLIP accuracy on unacceptable concepts), utility-preserving (~93% normalized CLIP score on acceptable concepts), and robust (~4% CLIP accuracy on adversarial prompts for unacceptable concepts). Finally, we present theoretical bounds for the certified robustness of Espresso against adversarial prompts, and an empirical analysis.

翻译：基于扩散的文本到图像（T2I）模型能够根据给定文本提示生成高保真图像。这些模型在海量网络数据上训练，可能包含不可接受概念（如侵权或不安全内容）。过滤训练数据中的不可接受概念后重新训练T2I模型效率低下且会降低实用性。因此，需要具备以下特性的概念移除技术（CRT）：有效移除不可接受概念、保持可接受概念实用性、并对对抗性提示具有鲁棒性。现有过滤和微调类CRT均无法同时满足所有要求。我们提出Espresso——首个基于对比语言-图像预训练（CLIP）的鲁棒性概念过滤器。该方法通过将生成图像的嵌入投影到联合文本-图像嵌入空间中连接不可接受与可接受概念的向量上，来识别不可接受概念。这种设计通过限制对抗者仅能沿该向量方向向可接受概念添加噪声来保证鲁棒性。进一步对Espresso进行微调以分离可接受与不可接受概念的嵌入，同时保持其与图像嵌入的配对关系，从而确保有效性和实用性。我们在十一个概念上评估Espresso，证明其有效性（不可接受概念上~5%的CLIP准确率）、实用性保持（可接受概念上~93%的归一化CLIP分数）和鲁棒性（针对不可接受概念的对抗性提示上~4%的CLIP准确率）。最后，我们给出了Espresso对抗性提示认证鲁棒性的理论边界及实验分析。