Text-to-Image (T2I) models have shown great performance in generating images based on textual prompts. However, these models are vulnerable to unsafe input to generate unsafe content like sexual, harassment and illegal-activity images. Existing studies based on image checker, model fine-tuning and embedding blocking are impractical in real-world applications. Hence, we propose the first universal prompt optimizer for safe T2I (POSI) generation in black-box scenario. We first construct a dataset consisting of toxic-clean prompt pairs by GPT-3.5 Turbo. To guide the optimizer to have the ability of converting toxic prompt to clean prompt while preserving semantic information, we design a novel reward function measuring toxicity and text alignment of generated images and train the optimizer through Proximal Policy Optimization. Experiments show that our approach can effectively reduce the likelihood of various T2I models in generating inappropriate images, with no significant impact on text alignment. It is also flexible to be combined with methods to achieve better performance. Our code is available at https://github.com/wu-zongyu/POSI.
翻译:文本到图像(T2I)模型在根据文本提示生成图像方面已展现出卓越的性能。然而,这些模型易受不安全输入的影响,从而生成诸如色情、骚扰及非法活动图像等不安全内容。现有基于图像检查器、模型微调和嵌入阻断的研究方法在实际应用中缺乏可行性。因此,我们提出了首个适用于黑盒场景的安全T2I通用提示优化器(POSI)。我们首先利用GPT-3.5 Turbo构建了一个包含有害-洁净提示对的数据集。为引导优化器在保留语义信息的同时具备将有害提示转换为洁净提示的能力,我们设计了一种新颖的奖励函数,用于衡量生成图像的毒性和文本对齐度,并通过近端策略优化算法对优化器进行训练。实验表明,我们的方法能有效降低多种T2I模型生成不当图像的可能性,且对文本对齐度无显著影响。该方法还可灵活与其他方法结合以获得更优性能。我们的代码公开于 https://github.com/wu-zongyu/POSI。