Recently, text-to-image (T2I) synthesis has undergone significant advancements, particularly with the emergence of Large Language Models (LLM) and their enhancement in Large Vision Models (LVM), greatly enhancing the instruction-following capabilities of traditional T2I models. Nevertheless, previous methods focus on improving generation quality but introduce unsafe factors into prompts. We explore that appending specific camera descriptions to prompts can enhance safety performance. Consequently, we propose a simple and safe prompt engineering method (SSP) to improve image generation quality by providing optimal camera descriptions. Specifically, we create a dataset from multi-datasets as original prompts. To select the optimal camera, we design an optimal camera matching approach and implement a classifier for original prompts capable of automatically matching. Appending camera descriptions to original prompts generates optimized prompts for further LVM image generation. Experiments demonstrate that SSP improves semantic consistency by an average of 16% compared to others and safety metrics by 48.9%.
翻译:近期,文本到图像(T2I)合成技术取得了显著进展,尤其是随着大语言模型(LLM)的发展及其对大视觉模型(LVM)的增强,传统T2I模型的指令跟随能力大幅提升。然而,现有方法侧重于提升生成质量,却引入了提示中的不安全因素。我们研究发现,在提示后附加特定相机描述能够增强安全性。为此,我们提出一种简单安全的提示工程方法(SSP),通过提供最优相机描述来提升图像生成质量。具体而言,我们基于多数据集构建原始提示数据集,并设计最优相机匹配方法,通过分类器实现原始提示的自动匹配。将相机描述附加至原始提示后,生成优化提示以驱动LVM图像生成。实验表明,SSP在语义一致性上平均提升16%,安全性指标提升48.9%。