InstructTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models

Large vision-language models (LVLMs) have demonstrated their incredible capability in image understanding and response generation. However, this rich visual interaction also makes LVLMs vulnerable to adversarial examples. In this paper, we formulate a novel and practical targeted attack scenario that the adversary can only know the vision encoder of the victim LVLM, without the knowledge of its prompts (which are often proprietary for service providers and not publicly available) and its underlying large language model (LLM). This practical setting poses challenges to the cross-prompt and cross-model transferability of targeted adversarial attack, which aims to confuse the LVLM to output a response that is semantically similar to the attacker's chosen target text. To this end, we propose an instruction-tuned targeted attack (dubbed \textsc{InstructTA}) to deliver the targeted adversarial attack on LVLMs with high transferability. Initially, we utilize a public text-to-image generative model to "reverse" the target response into a target image, and employ GPT-4 to infer a reasonable instruction $\boldsymbol{p}^\prime$ from the target response. We then form a local surrogate model (sharing the same vision encoder with the victim LVLM) to extract instruction-aware features of an adversarial image example and the target image, and minimize the distance between these two features to optimize the adversarial example. To further improve the transferability with instruction tuning, we augment the instruction $\boldsymbol{p}^\prime$ with instructions paraphrased from GPT-4. Extensive experiments demonstrate the superiority of our proposed method in targeted attack performance and transferability. The code is available at https://github.com/xunguangwang/InstructTA.

翻译：大型视觉语言模型（LVLM）已在图像理解与响应生成方面展现出卓越能力。然而，这种丰富的视觉交互特性也使LVLM易受对抗样本攻击。本文提出一种新颖且实用的定向攻击场景：攻击者仅能获知目标LVLM的视觉编码器，而无法获取其提示词（通常作为服务商的专有资产不对外公开）及其底层大型语言模型（LLM）。这种实际设定对定向对抗攻击的跨提示词与跨模型可迁移性提出了挑战——该攻击旨在干扰LVLM使其输出与攻击者选定目标文本语义相似的响应。为此，我们提出指令调优定向攻击方法（简称\textsc{InstructTA}），以实现具有高可迁移性的LVLM定向对抗攻击。首先，我们利用公开的文生图生成模型将目标响应“逆向重构”为目标图像，并采用GPT-4从目标响应中推演出合理指令$\boldsymbol{p}^\prime$。随后构建局部代理模型（与目标LVLM共享相同视觉编码器），通过提取对抗图像样本与目标图像的指令感知特征，并最小化二者特征距离以优化对抗样本。为通过指令调优进一步提升可迁移性，我们采用GPT-4生成的释义指令对原始指令$\boldsymbol{p}^\prime$进行数据增强。大量实验证明，所提方法在定向攻击性能与可迁移性方面均具有显著优势。代码已发布于https://github.com/xunguangwang/InstructTA。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日