HTS-Attack: Heuristic Token Search for Jailbreaking Text-to-Image Models

Text-to-Image(T2I) models have achieved remarkable success in image generation and editing, yet these models still have many potential issues, particularly in generating inappropriate or Not-Safe-For-Work(NSFW) content. Strengthening attacks and uncovering such vulnerabilities can advance the development of reliable and practical T2I models. Most of the previous works treat T2I models as white-box systems, using gradient optimization to generate adversarial prompts. However, accessing the model's gradient is often impossible in real-world scenarios. Moreover, existing defense methods, those using gradient masking, are designed to prevent attackers from obtaining accurate gradient information. While several black-box jailbreak attacks have been explored, they achieve the limited performance of jailbreaking T2I models due to difficulties associated with optimization in discrete spaces. To address this, we propose HTS-Attack, a heuristic token search attack method. HTS-Attack begins with an initialization that removes sensitive tokens, followed by a heuristic search where high-performing candidates are recombined and mutated. This process generates a new pool of candidates, and the optimal adversarial prompt is updated based on their effectiveness. By incorporating both optimal and suboptimal candidates, HTS-Attack avoids local optima and improves robustness in bypassing defenses. Extensive experiments validate the effectiveness of our method in attacking the latest prompt checkers, post-hoc image checkers, securely trained T2I models, and online commercial models.

翻译：文生图模型在图像生成与编辑方面取得了显著成功，但这些模型仍存在诸多潜在问题，特别是在生成不当或不适宜工作场所的内容方面。强化攻击并揭示此类漏洞，能够推动可靠且实用的文生图模型的发展。先前大多数研究将文生图模型视为白盒系统，利用梯度优化生成对抗性提示。然而，在实际场景中获取模型的梯度信息往往不可行。此外，现有的防御方法（如采用梯度掩码的技术）旨在阻止攻击者获取准确的梯度信息。尽管已有若干黑盒越狱攻击被探索，但由于在离散空间中进行优化的困难，它们在实现文生图模型越狱方面的性能有限。为解决这一问题，我们提出了HTS-Attack，一种启发式令牌搜索攻击方法。HTS-Attack首先通过移除敏感令牌进行初始化，随后进行启发式搜索，在此过程中对高性能候选提示进行重组与突变。该过程生成新的候选池，并根据其有效性更新最优对抗性提示。通过同时纳入最优与次优候选，HTS-Attack避免了局部最优解，并提升了绕过防御的鲁棒性。大量实验验证了我们的方法在攻击最新提示检查器、后验图像检查器、安全训练的文生图模型以及在线商业模型方面的有效性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日