State-of-the-art Diffusion Models (DMs) produce highly realistic images. While prior work has successfully mitigated Not Safe For Work (NSFW) content in the visual domain, we identify a novel threat: the generation of NSFW text embedded within images. This includes offensive language, such as insults, racial slurs, and sexually explicit terms, posing significant risks to users. We show that all state-of-the-art DMs (e.g., SD3, SDXL, Flux, DeepFloyd IF) are vulnerable to this issue. Through extensive experiments, we demonstrate that existing mitigation techniques, effective for visual content, fail to prevent harmful text generation while substantially degrading benign text generation. As an initial step toward addressing this threat, we introduce a novel fine-tuning strategy that targets only the text-generation layers in DMs. Therefore, we construct a safety fine-tuning dataset by pairing each NSFW prompt with two images: one with the NSFW term, and another where that term is replaced with a carefully crafted benign alternative while leaving the image unchanged otherwise. By training on this dataset, the model learns to avoid generating harmful text while preserving benign content and overall image quality. Finally, to advance research in the area, we release ToxicBench, an open-source benchmark for evaluating NSFW text generation in images. It includes our curated fine-tuning dataset, a set of harmful prompts, new evaluation metrics, and a pipeline that assesses both NSFW-ness and text and image quality. Our benchmark aims to guide future efforts in mitigating NSFW text generation in text-to-image models, thereby contributing to their safe deployment.
翻译:最先进的扩散模型能够生成高度逼真的图像。尽管先前的研究已成功缓解了视觉领域的不适宜工作场所内容,但我们发现了一种新型威胁:图像中嵌入的不适宜工作场所文本的生成。这包括侮辱性语言、种族歧视用语和露骨的性相关词汇等冒犯性内容,对用户构成重大风险。我们证明所有最先进的扩散模型(如SD3、SDXL、Flux、DeepFloyd IF)均存在此漏洞。通过大量实验,我们发现现有针对视觉内容的缓解技术虽有效,却无法阻止有害文本的生成,同时还会严重损害良性文本的生成质量。作为应对这一威胁的初步尝试,我们提出了一种新颖的微调策略,该策略仅针对扩散模型中的文本生成层。为此,我们构建了一个安全微调数据集,将每个不适宜工作场所提示词与两张图像配对:一张包含不适宜工作场所词汇,另一张则将该词汇替换为精心设计的良性替代词,同时保持图像其他部分不变。通过在此数据集上训练,模型学会避免生成有害文本,同时保留良性内容及整体图像质量。最后,为推进该领域的研究,我们发布了ToxicBench,一个用于评估图像中不适宜工作场所文本生成的开源基准。它包括我们精心整理的微调数据集、一组有害提示词、新的评估指标,以及一个同时评估不适宜工作场所程度、文本质量和图像质量的流程。我们的基准旨在指导未来缓解文生图模型中不适宜工作场所文本生成的研究,从而为其安全部署贡献力量。