State-of-the-art Text-to-Image models like Stable Diffusion and DALLE$\cdot$2 are revolutionizing how people generate visual content. At the same time, society has serious concerns about how adversaries can exploit such models to generate unsafe images. In this work, we focus on demystifying the generation of unsafe images and hateful memes from Text-to-Image models. We first construct a typology of unsafe images consisting of five categories (sexually explicit, violent, disturbing, hateful, and political). Then, we assess the proportion of unsafe images generated by four advanced Text-to-Image models using four prompt datasets. We find that these models can generate a substantial percentage of unsafe images; across four models and four prompt datasets, 14.56% of all generated images are unsafe. When comparing the four models, we find different risk levels, with Stable Diffusion being the most prone to generating unsafe content (18.92% of all generated images are unsafe). Given Stable Diffusion's tendency to generate more unsafe content, we evaluate its potential to generate hateful meme variants if exploited by an adversary to attack a specific individual or community. We employ three image editing methods, DreamBooth, Textual Inversion, and SDEdit, which are supported by Stable Diffusion. Our evaluation result shows that 24% of the generated images using DreamBooth are hateful meme variants that present the features of the original hateful meme and the target individual/community; these generated images are comparable to hateful meme variants collected from the real world. Overall, our results demonstrate that the danger of large-scale generation of unsafe images is imminent. We discuss several mitigating measures, such as curating training data, regulating prompts, and implementing safety filters, and encourage better safeguard tools to be developed to prevent unsafe generation.
翻译:最先进的文本到图像模型(如Stable Diffusion和DALL·E 2)正在彻底改变人们生成视觉内容的方式。与此同时,社会对恶意行为者可利用此类模型生成不安全图像的严重担忧日益凸显。本研究聚焦于揭示文本到图像模型生成不安全图像与仇恨梗图的机制。我们首先构建了包含五类(色情、暴力、令人不安、仇恨和政治性)的不安全图像类型学。继而使用四组提示数据,评估了四种先进文本到图像模型生成不安全图像的比例。研究发现,这些模型能生成相当比例的不安全图像;在四种模型与四组提示数据的交叉测试中,14.56%的生成图像属于不安全范畴。对比四类模型时,我们发现了不同风险等级:其中Stable Diffusion最易生成不安全内容(其生成图像中18.92%属于不安全类型)。鉴于Stable Diffusion生成更多不安全内容的倾向,我们评估了当其被恶意攻击者利用以针对特定个体或社群时,可能生成的仇恨梗图变体。我们采用Stable Diffusion支持的三种图像编辑方法:DreamBooth、Textual Inversion和SDEdit。评估结果显示,使用DreamBooth生成的图像中,有24%属于呈现原始仇恨梗图与目标个体/社群特征的仇恨梗图变体;这些生成图像与来自真实世界的仇恨梗图变体具有可比性。总体而言,我们的研究结果表明大规模生成不安全图像的危险迫在眉睫。我们讨论了几项缓解措施,如整理训练数据、规范提示词和实施安全过滤器,并鼓励开发更完善的安全工具以防止不安全内容的生成。