Recent text-to-image (T2I) models have exhibited remarkable performance in generating high-quality images from text descriptions. However, these models are vulnerable to misuse, particularly generating not-safe-for-work (NSFW) content, such as sexually explicit, violent, political, and disturbing images, raising serious ethical concerns. In this work, we present PromptGuard, a novel content moderation technique that draws inspiration from the system prompt mechanism in large language models (LLMs) for safety alignment. Unlike LLMs, T2I models lack a direct interface for enforcing behavioral guidelines. Our key idea is to optimize a safety soft prompt that functions as an implicit system prompt within the T2I model's textual embedding space. This universal soft prompt (P*) directly moderates NSFW inputs, enabling safe yet realistic image generation without altering the inference efficiency or requiring proxy models. We further enhance its reliability and helpfulness through a divide-and-conquer strategy, which optimizes category-specific soft prompts and combines them into holistic safety guidance. Extensive experiments across five datasets demonstrate that PromptGuard effectively mitigates NSFW content generation while preserving high-quality benign outputs. PromptGuard achieves 3.8 times faster than prior content moderation methods, surpassing eight state-of-the-art defenses with an optimal unsafe ratio down to 5.84%.
翻译:近年来,文生图(T2I)模型在根据文本描述生成高质量图像方面展现出卓越性能。然而,这些模型易被滥用,尤其可能生成不适合工作场所(NSFW)的内容,例如色情、暴力、政治敏感及令人不适的图像,引发了严重的伦理担忧。本研究提出PromptGuard,一种新颖的内容审核技术,其灵感来源于大语言模型(LLM)中用于安全对齐的系统提示机制。与LLM不同,T2I模型缺乏直接强制执行行为准则的接口。我们的核心思想是优化一个安全软提示,使其在T2I模型的文本嵌入空间中充当隐式系统提示。该通用软提示(P*)可直接对NSFW输入进行调节,在不影响推理效率或无需代理模型的情况下,实现安全且逼真的图像生成。我们进一步通过分治策略增强其可靠性与实用性,该策略优化特定类别的软提示并将其组合为整体安全引导。在五个数据集上的大量实验表明,PromptGuard能有效减少NSFW内容生成,同时保持高质量良性输出。PromptGuard的运行速度比现有内容审核方法快3.8倍,在八种先进防御方法中表现最优,将不安全比例降至5.84%。