Large Vision-Language Models (LVLMs) are increasingly equipped with robust safety safeguards to prevent responses to harmful or disallowed prompts. However, these defenses often focus on analyzing explicit textual inputs or relevant visual scenes. In this work, we introduce Text-DJ, a novel jailbreak attack that bypasses these safeguards by exploiting the model's Optical Character Recognition (OCR) capability. Our methodology consists of three stages. First, we decompose a single harmful query into multiple and semantically related but more benign sub-queries. Second, we pick a set of distraction queries that are maximally irrelevant to the harmful query. Third, we present all decomposed sub-queries and distraction queries to the LVLM simultaneously as a grid of images, with the position of the sub-queries being middle within the grid. We demonstrate that this method successfully circumvents the safety alignment of state-of-the-art LVLMs. We argue this attack succeeds by (1) converting text-based prompts into images, bypassing standard text-based filters, and (2) inducing distractions, where the model's safety protocols fail to link the scattered sub-queries within a high number of irrelevant queries. Overall, our findings expose a critical vulnerability in LVLMs' OCR capabilities that are not robust to dispersed, multi-image adversarial inputs, highlighting the need for defenses for fragmented multimodal inputs.
翻译:大型视觉语言模型(LVLM)正日益配备强大的安全防护机制,以防止对有害或违规提示作出响应。然而,这些防御措施通常侧重于分析显式文本输入或相关视觉场景。在本研究中,我们提出了Text-DJ,一种新颖的越狱攻击方法,通过利用模型的OCR能力来绕过这些安全防护。我们的方法包含三个阶段:首先,将单个有害查询分解为多个语义相关但更为良性的子查询;其次,选取一组与有害查询最大程度无关的干扰查询;最后,将所有分解后的子查询和干扰查询以图像网格的形式同时呈现给LVLM,其中子查询的位置位于网格中部。我们证明该方法能成功规避当前最先进LVLM的安全对齐机制。我们认为此攻击成功的原因在于:(1)将基于文本的提示转换为图像,绕过了标准的文本过滤器;(2)引入干扰,使模型的安全协议无法在大量无关查询中关联起分散的子查询。总体而言,我们的研究揭示了LVLM OCR能力中存在一个关键漏洞——其无法有效应对分散的多图像对抗输入,这凸显了针对碎片化多模态输入进行防御的必要性。