Wukong Framework for Not Safe For Work Detection in Text-to-Image systems

Text-to-Image (T2I) generation is a popular AI-generated content (AIGC) technology enabling diverse and creative image synthesis. However, some outputs may contain Not Safe For Work (NSFW) content (e.g., violence), violating community guidelines. Detecting NSFW content efficiently and accurately, known as external safeguarding, is essential. Existing external safeguards fall into two types: text filters, which analyze user prompts but overlook T2I model-specific variations and are prone to adversarial attacks; and image filters, which analyze final generated images but are computationally costly and introduce latency. Diffusion models, the foundation of modern T2I systems like Stable Diffusion, generate images through iterative denoising using a U-Net architecture with ResNet and Transformer blocks. We observe that: (1) early denoising steps define the semantic layout of the image, and (2) cross-attention layers in U-Net are crucial for aligning text and image regions. Based on these insights, we propose Wukong, a transformer-based NSFW detection framework that leverages intermediate outputs from early denoising steps and reuses U-Net's pre-trained cross-attention parameters. Wukong operates within the diffusion process, enabling early detection without waiting for full image generation. We also introduce a new dataset containing prompts, seeds, and image-specific NSFW labels, and evaluate Wukong on this and two public benchmarks. Results show that Wukong significantly outperforms text-based safeguards and achieves comparable accuracy of image filters, while offering much greater efficiency.

翻译：文生图（T2I）生成是一种流行的AI生成内容（AIGC）技术，能够实现多样化和创造性的图像合成。然而，部分输出可能包含非安全工作场所（NSFW）内容（例如暴力），违反社区准则。高效准确地检测NSFW内容（即外部安全防护）至关重要。现有的外部防护措施分为两类：文本过滤器（分析用户提示但忽略T2I模型特有的变体且易受对抗性攻击）和图像过滤器（分析最终生成图像但计算成本高且引入延迟）。扩散模型作为现代T2I系统（如Stable Diffusion）的基础，通过采用包含ResNet和Transformer块的U-Net架构进行迭代去噪来生成图像。我们观察到：（1）早期去噪步骤定义了图像的语义布局；（2）U-Net中的交叉注意力层对于对齐文本和图像区域至关重要。基于这些发现，我们提出Wukong——一个基于Transformer的NSFW检测框架，该框架利用早期去噪步骤的中间输出并重用U-Net预训练的交叉注意力参数。Wukong在扩散过程内部运行，无需等待完整图像生成即可实现早期检测。我们还引入了一个包含提示、种子和图像特定NSFW标签的新数据集，并在此数据集及两个公共基准上评估Wukong。结果表明，Wukong显著优于基于文本的防护措施，并达到与图像过滤器相当的准确度，同时提供更高的效率。