Espresso: Robust Concept Filtering in Text-to-Image Models

Diffusion-based text-to-image (T2I) models generate high-fidelity images for given textual prompts. They are trained on large datasets scraped from the Internet, potentially containing unacceptable concepts (e.g., copyright infringing or unsafe). Retraining T2I models after filtering out unacceptable concepts in the training data is inefficient and degrades utility. Hence, there is a need for concept removal techniques (CRTs) which are effective in removing unacceptable concepts, utility-preserving on acceptable concepts, and robust against evasion with adversarial prompts. None of the prior filtering and fine-tuning CRTs satisfy all these requirements simultaneously. We introduce Espresso, the first robust concept filter based on Contrastive Language-Image Pre-Training (CLIP). It identifies unacceptable concepts by projecting the generated image's embedding onto the vector connecting unacceptable and acceptable concepts in the joint text-image embedding space. This ensures robustness by restricting the adversary to adding noise only along this vector, in the direction of the acceptable concept. Further fine-tuning Espresso to separate embeddings of acceptable and unacceptable concepts, while preserving their pairing with image embeddings, ensures both effectiveness and utility. We evaluate Espresso on eleven concepts to show that it is effective (~5% CLIP accuracy on unacceptable concepts), utility-preserving (~93% normalized CLIP score on acceptable concepts), and robust (~4% CLIP accuracy on adversarial prompts for unacceptable concepts). Finally, we present theoretical bounds for the certified robustness of Espresso against adversarial prompts, and an empirical analysis.

翻译：基于扩散的文本到图像（T2I）模型能够根据给定文本提示生成高保真图像。它们是在从互联网抓取的大规模数据集上训练的，这些数据集可能包含不可接受的概念（例如，侵权或不安全内容）。在过滤掉训练数据中的不可接受概念后重新训练T2I模型效率低下且会降低实用性。因此，需要概念移除技术（CRTs），该技术应能有效移除不可接受概念，保持对可接受概念的实用性，并能抵御通过对抗提示进行的规避。先前的过滤和微调CRT方法均无法同时满足所有这些要求。我们提出Espresso，这是首个基于对比语言-图像预训练（CLIP）的鲁棒概念过滤器。它通过将生成图像的嵌入投影到联合文本-图像嵌入空间中连接不可接受概念与可接受概念的向量上，来识别不可接受概念。这通过限制对手仅沿该向量朝可接受概念方向添加噪声，确保了鲁棒性。进一步微调Espresso以分离可接受与不可接受概念的嵌入，同时保持它们与图像嵌入的配对，从而确保了有效性和实用性。我们在十一个概念上评估Espresso，结果表明其有效（对不可接受概念的CLIP准确率约5%）、实用性保持（对可接受概念的归一化CLIP得分约93%）且鲁棒（对针对不可接受概念的对抗提示的CLIP准确率约4%）。最后，我们给出了Espresso对抗提示的认证鲁棒性的理论界及实证分析。