Understanding the vulnerabilities of Large Vision Language Models (LVLMs) to jailbreak attacks is essential for their responsible real-world deployment. Most previous work requires access to model gradients, or is based on human knowledge (prompt engineering) to complete jailbreak, and they hardly consider the interaction of images and text, resulting in inability to jailbreak in black box scenarios or poor performance. To overcome these limitations, we propose a Prior-Guided Bimodal Interactive Black-Box Jailbreak Attack for toxicity maximization, referred to as PBI-Attack. Our method begins by extracting malicious features from a harmful corpus using an alternative LVLM and embedding these features into a benign image as prior information. Subsequently, we enhance these features through bidirectional cross-modal interaction optimization, which iteratively optimizes the bimodal perturbations in an alternating manner through greedy search, aiming to maximize the toxicity of the generated response. The toxicity level is quantified using a well-trained evaluation model. Experiments demonstrate that PBI-Attack outperforms previous state-of-the-art jailbreak methods, achieving an average attack success rate of 92.5% across three open-source LVLMs and around 67.3% on three closed-source LVLMs. Disclaimer: This paper contains potentially disturbing and offensive content.
翻译:理解大型视觉语言模型(LVLMs)对越狱攻击的脆弱性,对于其在现实世界中的负责任部署至关重要。先前大多数工作需要访问模型梯度,或基于人工知识(提示工程)来完成越狱,且很少考虑图像与文本的交互作用,导致无法在黑盒场景下实现越狱或性能不佳。为克服这些限制,我们提出了一种面向毒性最大化的先验引导双模态交互式黑盒越狱攻击方法,简称PBI-Attack。我们的方法首先使用一个替代的LVLM从有害语料中提取恶意特征,并将这些特征作为先验信息嵌入到一张良性图像中。随后,我们通过双向跨模态交互优化来增强这些特征,该优化通过贪婪搜索以交替方式迭代优化双模态扰动,旨在最大化生成响应的毒性。毒性水平通过一个训练有素的评估模型进行量化。实验表明,PBI-Attack优于先前最先进的越狱方法,在三个开源LVLM上实现了平均92.5%的攻击成功率,在三个闭源LVLM上达到了约67.3%的成功率。免责声明:本文包含可能令人不安和冒犯的内容。