Text-to-Image (T2I) models have shown great performance in generating images based on textual prompts. However, these models are vulnerable to unsafe input to generate unsafe content like sexual, harassment and illegal-activity images. Existing studies based on image checker, model fine-tuning and embedding blocking are impractical in real-world applications. Hence, \textit{we propose the first universal prompt optimizer for safe T2I generation in black-box scenario}. We first construct a dataset consisting of toxic-clean prompt pairs by GPT-3.5 Turbo. To guide the optimizer to have the ability of converting toxic prompt to clean prompt while preserving semantic information, we design a novel reward function measuring toxicity and text alignment of generated images and train the optimizer through Proximal Policy Optimization. Experiments show that our approach can effectively reduce the likelihood of various T2I models in generating inappropriate images, with no significant impact on text alignment. It is also flexible to be combined with methods to achieve better performance.
翻译:文本到图像(T2I)模型在根据文本提示生成图像方面表现出色。然而,这些模型容易受到不安全输入的干扰,生成包含色情、骚扰和非法活动等不安全内容的图像。现有基于图像检查器、模型微调和嵌入阻断的方法在实际应用中难以实施。因此,\textit{我们首次提出了一种在黑盒场景下实现安全T2I生成的通用提示优化器}。我们首先利用GPT-3.5 Turbo构建了一个由有毒-清洁提示对组成的数据集。为了引导优化器具备在保留语义信息的同时将有毒提示转换为清洁提示的能力,我们设计了一种新颖的奖励函数,用于衡量生成图像的毒性和文本对齐度,并通过近端策略优化训练优化器。实验表明,我们的方法能够有效降低多种T2I模型生成不当图像的可能性,且对文本对齐度无显著影响。该方法还可灵活与其他方法结合,以实现更优性能。