Text-to-Image(T2I) models have achieved remarkable success in image generation and editing, yet these models still have many potential issues, particularly in generating inappropriate or Not-Safe-For-Work(NSFW) content. Strengthening attacks and uncovering such vulnerabilities can advance the development of reliable and practical T2I models. Most of the previous works treat T2I models as white-box systems, using gradient optimization to generate adversarial prompts. However, accessing the model's gradient is often impossible in real-world scenarios. Moreover, existing defense methods, those using gradient masking, are designed to prevent attackers from obtaining accurate gradient information. While several black-box jailbreak attacks have been explored, they achieve the limited performance of jailbreaking T2I models due to difficulties associated with optimization in discrete spaces. To address this, we propose HTS-Attack, a heuristic token search attack method. HTS-Attack begins with an initialization that removes sensitive tokens, followed by a heuristic search where high-performing candidates are recombined and mutated. This process generates a new pool of candidates, and the optimal adversarial prompt is updated based on their effectiveness. By incorporating both optimal and suboptimal candidates, HTS-Attack avoids local optima and improves robustness in bypassing defenses. Extensive experiments validate the effectiveness of our method in attacking the latest prompt checkers, post-hoc image checkers, securely trained T2I models, and online commercial models.
翻译:文生图模型在图像生成与编辑方面取得了显著成功,但这些模型仍存在诸多潜在问题,特别是在生成不当或不适宜工作场所的内容方面。强化攻击并揭示此类漏洞,能够推动可靠且实用的文生图模型的发展。先前大多数研究将文生图模型视为白盒系统,利用梯度优化生成对抗性提示。然而,在实际场景中获取模型的梯度信息往往不可行。此外,现有的防御方法(如采用梯度掩码的技术)旨在阻止攻击者获取准确的梯度信息。尽管已有若干黑盒越狱攻击被探索,但由于在离散空间中进行优化的困难,它们在实现文生图模型越狱方面的性能有限。为解决这一问题,我们提出了HTS-Attack,一种启发式令牌搜索攻击方法。HTS-Attack首先通过移除敏感令牌进行初始化,随后进行启发式搜索,在此过程中对高性能候选提示进行重组与突变。该过程生成新的候选池,并根据其有效性更新最优对抗性提示。通过同时纳入最优与次优候选,HTS-Attack避免了局部最优解,并提升了绕过防御的鲁棒性。大量实验验证了我们的方法在攻击最新提示检查器、后验图像检查器、安全训练的文生图模型以及在线商业模型方面的有效性。