Large vision-language models (LVLMs) have demonstrated their incredible capability in image understanding and response generation. However, this rich visual interaction also makes LVLMs vulnerable to adversarial examples. In this paper, we formulate a novel and practical gray-box attack scenario that the adversary can only access the visual encoder of the victim LVLM, without the knowledge of its prompts (which are often proprietary for service providers and not publicly available) and its underlying large language model (LLM). This practical setting poses challenges to the cross-prompt and cross-model transferability of targeted adversarial attack, which aims to confuse the LVLM to output a response that is semantically similar to the attacker's chosen target text. To this end, we propose an instruction-tuned targeted attack (dubbed InstructTA) to deliver the targeted adversarial attack on LVLMs with high transferability. Initially, we utilize a public text-to-image generative model to "reverse" the target response into a target image, and employ GPT-4 to infer a reasonable instruction $\boldsymbol{p}^\prime$ from the target response. We then form a local surrogate model (sharing the same visual encoder with the victim LVLM) to extract instruction-aware features of an adversarial image example and the target image, and minimize the distance between these two features to optimize the adversarial example. To further improve the transferability, we augment the instruction $\boldsymbol{p}^\prime$ with instructions paraphrased from an LLM. Extensive experiments demonstrate the superiority of our proposed method in targeted attack performance and transferability.
翻译:大型视觉-语言模型(LVLMs)在图像理解与响应生成方面展现出非凡能力,然而这种丰富的视觉交互也使其易受对抗样本攻击。本文提出一种新颖且实用的灰盒攻击场景:攻击者仅能访问受害者LVLM的视觉编码器,无法获知其提示词(通常为服务提供商专有且不公开)及底层大语言模型(LLM)。这种实际场景对定向对抗攻击的跨提示词与跨模型迁移性构成挑战,此类攻击旨在混淆LVLM,使其输出语义上接近攻击者选定目标文本的响应。为此,我们提出一种指令调优定向攻击方法(称为InstructTA),以实现对LVLM高迁移性的定向对抗攻击。首先,利用公开文本到图像生成模型将目标响应“逆向”转换为目标图像,并采用GPT-4从目标响应中推断合理指令$\boldsymbol{p}^\prime$。随后构建与受害者LVLM共享相同视觉编码器的局部代理模型,提取对抗图像样本与目标图像的指令感知特征,并最小化二者特征距离以优化对抗样本。为提升迁移性,我们通过LLM释义生成的指令增强原始指令$\boldsymbol{p}^\prime$。大量实验表明,本方法在定向攻击性能与迁移性方面具有显著优势。