Text-to-image (T2I) models have raised increasing safety concerns due to their capacity to generate NSFW and other banned objects. To mitigate these risks, safety filters and concept removal techniques have been introduced to block inappropriate prompts or erase sensitive concepts from the models. However, all the existing defense methods are not well prepared to handle diverse adversarial prompts. In this work, we introduce MacPrompt, a novel black-box and cross-lingual attack that reveals previously overlooked vulnerabilities in T2I safety mechanisms. Unlike existing attacks that rely on synonym substitution or prompt obfuscation, MacPrompt constructs macaronic adversarial prompts by performing cross-lingual character-level recombination of harmful terms, enabling fine-grained control over both semantics and appearance. By leveraging this design, MacPrompt crafts prompts with high semantic similarity to the original harmful inputs (up to 0.96) while bypassing major safety filters (up to 100%). More critically, it achieves attack success rates as high as 92% for sex-related content and 90% for violence, effectively breaking even state-of-the-art concept removal defenses. These results underscore the pressing need to reassess the robustness of existing T2I safety mechanisms against linguistically diverse and fine-grained adversarial strategies.
翻译:文本到图像(T2I)模型因其生成NSFW及其他违禁内容的能力而引发了日益增长的安全担忧。为缓解这些风险,安全过滤器和概念移除技术已被引入,以拦截不当提示或从模型中擦除敏感概念。然而,所有现有防御方法均未充分准备好应对多样化的对抗性提示。本文提出MacPrompt,一种新颖的黑盒跨语言攻击方法,揭示了T2I安全机制中先前被忽视的脆弱性。与现有依赖同义词替换或提示混淆的攻击不同,MacPrompt通过对有害术语进行跨语言字符级重组来构建混合语对抗性提示,从而实现对语义和外观的细粒度控制。借助此设计,MacPrompt构建的提示与原始有害输入保持高度语义相似性(最高达0.96),同时能绕过主流安全过滤器(成功率高达100%)。更重要的是,其在色情相关内容上的攻击成功率高达92%,暴力内容达90%,甚至能有效突破最先进的概念移除防御。这些结果凸显了重新评估现有T2I安全机制对抗语言多样化且细粒度对抗策略的鲁棒性的迫切需求。