Text-to-Image (T2I) generation has advanced rapidly in recent years, but they also raise safety concerns due to the potential production of harmful content. In the practical deployments, T2I services typically adopt full-chain defenses that combine a prompt checker, a securely trained generator, and a post-hoc image checker. Jailbreaking such full-chain systems is challenging in the black-box settings because prompt tokens form a discrete combinatorial space and the attack must satisfy multiple coupled constraints under sparse feedback and limited queries. To address these challenges, we propose Token-level Constraint Boundary Search (TCBS)-Attack, a novel query-based black-box jailbreak attack that searches for tokens located near the decision boundaries defined by text and image checkers. TCBS-Attack incorporates decision boundaries as constraint conditions to guide the evolutionary search of token populations, iteratively optimize tokens near these boundaries. Such evolutionary search process reduces the effective search space and improves query efficiency while preserving semantic coherence. Extensive experiments demonstrate that TCBS-Attack consistently outperforms state-of-the-art jailbreak attacks across various T2I models, including securely trained open-source models and commercial online services like DALL-E 3. TCBS-Attack achieves an ASR-4 of 52.5% and an ASR-1 of 22.0% on jailbreaking full-chain T2I models, significantly surpassing baseline methods.
翻译:近年来,文本到图像(T2I)生成技术发展迅速,但也因其可能产生有害内容而引发了安全担忧。在实际部署中,T2I服务通常采用全链防御机制,结合提示词检查器、安全训练的生成器以及事后图像检查器。在黑盒设置下,对此类全链系统进行越狱攻击具有挑战性,因为提示词令牌构成了离散的组合空间,且攻击必须在稀疏反馈和有限查询次数的条件下满足多个耦合约束。为应对这些挑战,我们提出了令牌级约束边界搜索攻击,这是一种新颖的基于查询的黑盒越狱攻击方法,其核心是搜索位于由文本和图像检查器定义的决策边界附近的令牌。TCBS-Attack将决策边界作为约束条件纳入,以指导令牌种群的进化搜索,迭代优化这些边界附近的令牌。这种进化搜索过程缩小了有效搜索空间,提高了查询效率,同时保持了语义连贯性。大量实验表明,TCBS-Attack在各种T2I模型上均持续优于最先进的越狱攻击方法,包括安全训练的开源模型以及如DALL-E 3这样的商业在线服务。在针对全链T2I模型的越狱攻击中,TCBS-Attack实现了52.5%的ASR-4和22.0%的ASR-1,显著超越了基线方法。