Espresso: Robust Concept Filtering in Text-to-Image Models

Diffusion-based text-to-image (T2I) models generate high-fidelity images for given textual prompts. They are trained on large datasets scraped from the Internet, potentially containing unacceptable concepts (e.g., copyright infringing or unsafe). Retraining T2I models after filtering out unacceptable concepts in the training data is inefficient and degrades utility. Hence, there is a need for concept removal techniques (CRTs) which are effective in removing unacceptable concepts, utility-preserving on acceptable concepts, and robust against evasion with adversarial prompts. None of the prior filtering and fine-tuning CRTs satisfy all these requirements simultaneously. We introduce Espresso, the first robust concept filter based on Contrastive Language-Image Pre-Training (CLIP). It identifies unacceptable concepts by projecting the generated image's embedding onto the vector connecting unacceptable and acceptable concepts in the joint text-image embedding space. This ensures robustness by restricting the adversary to adding noise only along this vector, in the direction of the acceptable concept. Further fine-tuning Espresso to separate embeddings of acceptable and unacceptable concepts, while preserving their pairing with image embeddings, ensures both effectiveness and utility. We evaluate Espresso on eleven concepts to show that it is effective (~5% CLIP accuracy on unacceptable concepts), utility-preserving (~93% normalized CLIP score on acceptable concepts), and robust (~4% CLIP accuracy on adversarial prompts for unacceptable concepts). Finally, we present theoretical bounds for the certified robustness of Espresso against adversarial prompts, and an empirical analysis.

翻译：基于扩散的文本到图像（T2I）模型能够根据给定的文本提示生成高保真度图像。这些模型在从互联网抓取的大规模数据集上进行训练，其中可能包含不可接受的概念（例如，侵犯版权或不安全的内容）。在训练数据中过滤掉不可接受的概念后重新训练T2I模型效率低下，且会降低模型效用。因此，需要开发概念移除技术（CRTs），该技术应能有效移除不可接受的概念，在可接受概念上保持效用，并能抵御对抗性提示的规避攻击。现有的过滤与微调CRTs均无法同时满足所有这些要求。本文提出Espresso，首个基于对比语言-图像预训练（CLIP）的鲁棒概念过滤器。该方法通过将生成图像的嵌入向量投影到联合文本-图像嵌入空间中连接不可接受与可接受概念的向量上来识别不可接受概念。这种机制通过限制攻击者仅能沿该向量方向（朝向可接受概念）添加噪声来确保鲁棒性。进一步对Espresso进行微调以分离可接受与不可接受概念的嵌入，同时保持它们与图像嵌入的配对关系，从而确保有效性与效用性。我们在十一个概念上评估Espresso，结果表明其具备有效性（对不可接受概念的CLIP准确率约5%）、效用保持性（对可接受概念的归一化CLIP分数约93%）和鲁棒性（针对不可接受概念的对抗性提示CLIP准确率约4%）。最后，我们提出了Espresso对抗对抗性提示的认证鲁棒性理论界，并进行了实证分析。