Large vision-language models (LVLMs) have demonstrated their incredible capability in image understanding and response generation. However, this rich visual interaction also makes LVLMs vulnerable to adversarial examples. In this paper, we formulate a novel and practical gray-box attack scenario that the adversary can only access the visual encoder of the victim LVLM, without the knowledge of its prompts (which are often proprietary for service providers and not publicly available) and its underlying large language model (LLM). This practical setting poses challenges to the cross-prompt and cross-model transferability of targeted adversarial attack, which aims to confuse the LVLM to output a response that is semantically similar to the attacker's chosen target text. To this end, we propose an instruction-tuned targeted attack (dubbed InstructTA) to deliver the targeted adversarial attack on LVLMs with high transferability. Initially, we utilize a public text-to-image generative model to "reverse" the target response into a target image, and employ GPT-4 to infer a reasonable instruction $\boldsymbol{p}^\prime$ from the target response. We then form a local surrogate model (sharing the same visual encoder with the victim LVLM) to extract instruction-aware features of an adversarial image example and the target image, and minimize the distance between these two features to optimize the adversarial example. To further improve the transferability, we augment the instruction $\boldsymbol{p}^\prime$ with instructions paraphrased from an LLM. Extensive experiments demonstrate the superiority of our proposed method in targeted attack performance and transferability.
翻译:大型视觉语言模型在图像理解和响应生成方面展现了卓越的能力。然而,这种丰富的视觉交互也使得大型视觉语言模型易受对抗样本攻击。本文提出了一种新颖且实用的灰盒攻击场景:攻击者仅能访问受害者大型视觉语言模型的视觉编码器,而无法获知其提示词(通常为服务提供商专有且不公开)及其底层大语言模型。这一实际场景对定向对抗攻击的跨提示词与跨模型迁移性提出了挑战——该攻击旨在混淆大型视觉语言模型,使其输出语义上与攻击者选定目标文本相近的响应。为此,我们提出一种指令调优定向攻击方法(简称InstructTA),以实现对大型视觉语言模型的高迁移性定向攻击。首先,利用公开的文本到图像生成模型将目标响应“逆向”转换为目标图像,并通过GPT-4从目标响应中推断出合理指令$\boldsymbol{p}^\prime$。随后,构建一个与受害者大型视觉语言模型共享相同视觉编码器的本地替代模型,分别提取对抗图像示例与目标图像的指令感知特征,并最小化两者特征距离以优化对抗示例。为进一步提升迁移性,我们通过大语言模型对指令$\boldsymbol{p}^\prime$进行释义增强。大量实验证明了该方法在定向攻击性能与迁移性上的优越性。