With the prevalence of text-to-image generative models, their safety becomes a critical concern. adversarial testing techniques have been developed to probe whether such models can be prompted to produce Not-Safe-For-Work (NSFW) content. However, existing solutions face several challenges, including low success rate and inefficiency. We introduce Groot, the first automated framework leveraging tree-based semantic transformation for adversarial testing of text-to-image models. Groot employs semantic decomposition and sensitive element drowning strategies in conjunction with LLMs to systematically refine adversarial prompts. Our comprehensive evaluation confirms the efficacy of Groot, which not only exceeds the performance of current state-of-the-art approaches but also achieves a remarkable success rate (93.66%) on leading text-to-image models such as DALL-E 3 and Midjourney.
翻译:随着文本到图像生成模型的普及,其安全性成为关键问题。对抗性测试技术已被开发用于探测此类模型是否可能被提示生成不宜工作场所内容。然而,现有解决方案面临成功率低、效率低下等挑战。我们提出Groot,这是首个利用树状语义变换对文本到图像模型进行对抗性测试的自动化框架。Groot结合语义分解与敏感元素淹没策略,协同大型语言模型系统性地优化对抗性提示。综合评估证实了Groot的有效性,其不仅超越了当前最先进方法的性能,更在DALL-E 3和Midjourney等主流文本到图像模型上实现了显著的成功率(93.66%)。