Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models

State-of-the-art Text-to-Image models like Stable Diffusion and DALLE$\cdot$2 are revolutionizing how people generate visual content. At the same time, society has serious concerns about how adversaries can exploit such models to generate unsafe images. In this work, we focus on demystifying the generation of unsafe images and hateful memes from Text-to-Image models. We first construct a typology of unsafe images consisting of five categories (sexually explicit, violent, disturbing, hateful, and political). Then, we assess the proportion of unsafe images generated by four advanced Text-to-Image models using four prompt datasets. We find that these models can generate a substantial percentage of unsafe images; across four models and four prompt datasets, 14.56% of all generated images are unsafe. When comparing the four models, we find different risk levels, with Stable Diffusion being the most prone to generating unsafe content (18.92% of all generated images are unsafe). Given Stable Diffusion's tendency to generate more unsafe content, we evaluate its potential to generate hateful meme variants if exploited by an adversary to attack a specific individual or community. We employ three image editing methods, DreamBooth, Textual Inversion, and SDEdit, which are supported by Stable Diffusion. Our evaluation result shows that 24% of the generated images using DreamBooth are hateful meme variants that present the features of the original hateful meme and the target individual/community; these generated images are comparable to hateful meme variants collected from the real world. Overall, our results demonstrate that the danger of large-scale generation of unsafe images is imminent. We discuss several mitigating measures, such as curating training data, regulating prompts, and implementing safety filters, and encourage better safeguard tools to be developed to prevent unsafe generation.

翻译：最先进的文本到图像模型（如Stable Diffusion和DALL·E 2）正在彻底改变人们生成视觉内容的方式。与此同时，社会对恶意行为者可利用此类模型生成不安全图像的严重担忧日益凸显。本研究聚焦于揭示文本到图像模型生成不安全图像与仇恨梗图的机制。我们首先构建了包含五类（色情、暴力、令人不安、仇恨和政治性）的不安全图像类型学。继而使用四组提示数据，评估了四种先进文本到图像模型生成不安全图像的比例。研究发现，这些模型能生成相当比例的不安全图像；在四种模型与四组提示数据的交叉测试中，14.56%的生成图像属于不安全范畴。对比四类模型时，我们发现了不同风险等级：其中Stable Diffusion最易生成不安全内容（其生成图像中18.92%属于不安全类型）。鉴于Stable Diffusion生成更多不安全内容的倾向，我们评估了当其被恶意攻击者利用以针对特定个体或社群时，可能生成的仇恨梗图变体。我们采用Stable Diffusion支持的三种图像编辑方法：DreamBooth、Textual Inversion和SDEdit。评估结果显示，使用DreamBooth生成的图像中，有24%属于呈现原始仇恨梗图与目标个体/社群特征的仇恨梗图变体；这些生成图像与来自真实世界的仇恨梗图变体具有可比性。总体而言，我们的研究结果表明大规模生成不安全图像的危险迫在眉睫。我们讨论了几项缓解措施，如整理训练数据、规范提示词和实施安全过滤器，并鼓励开发更完善的安全工具以防止不安全内容的生成。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/