Large vision-language models (LVLMs) have demonstrated their incredible capability in image understanding and response generation. However, this rich visual interaction also makes LVLMs vulnerable to adversarial examples. In this paper, we formulate a novel and practical targeted attack scenario that the adversary can only know the vision encoder of the victim LVLM, without the knowledge of its prompts (which are often proprietary for service providers and not publicly available) and its underlying large language model (LLM). This practical setting poses challenges to the cross-prompt and cross-model transferability of targeted adversarial attack, which aims to confuse the LVLM to output a response that is semantically similar to the attacker's chosen target text. To this end, we propose an instruction-tuned targeted attack (dubbed \textsc{InstructTA}) to deliver the targeted adversarial attack on LVLMs with high transferability. Initially, we utilize a public text-to-image generative model to "reverse" the target response into a target image, and employ GPT-4 to infer a reasonable instruction $\boldsymbol{p}^\prime$ from the target response. We then form a local surrogate model (sharing the same vision encoder with the victim LVLM) to extract instruction-aware features of an adversarial image example and the target image, and minimize the distance between these two features to optimize the adversarial example. To further improve the transferability with instruction tuning, we augment the instruction $\boldsymbol{p}^\prime$ with instructions paraphrased from GPT-4. Extensive experiments demonstrate the superiority of our proposed method in targeted attack performance and transferability. The code is available at https://github.com/xunguangwang/InstructTA.
翻译:大型视觉语言模型(LVLM)已在图像理解与响应生成方面展现出卓越能力。然而,这种丰富的视觉交互特性也使LVLM易受对抗样本攻击。本文提出一种新颖且实用的定向攻击场景:攻击者仅能获知目标LVLM的视觉编码器,而无法获取其提示词(通常作为服务商的专有资产不对外公开)及其底层大型语言模型(LLM)。这种实际设定对定向对抗攻击的跨提示词与跨模型可迁移性提出了挑战——该攻击旨在干扰LVLM使其输出与攻击者选定目标文本语义相似的响应。为此,我们提出指令调优定向攻击方法(简称\textsc{InstructTA}),以实现具有高可迁移性的LVLM定向对抗攻击。首先,我们利用公开的文生图生成模型将目标响应“逆向重构”为目标图像,并采用GPT-4从目标响应中推演出合理指令$\boldsymbol{p}^\prime$。随后构建局部代理模型(与目标LVLM共享相同视觉编码器),通过提取对抗图像样本与目标图像的指令感知特征,并最小化二者特征距离以优化对抗样本。为通过指令调优进一步提升可迁移性,我们采用GPT-4生成的释义指令对原始指令$\boldsymbol{p}^\prime$进行数据增强。大量实验证明,所提方法在定向攻击性能与可迁移性方面均具有显著优势。代码已发布于https://github.com/xunguangwang/InstructTA。