InstructTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models

Large vision-language models (LVLMs) have demonstrated their incredible capability in image understanding and response generation. However, this rich visual interaction also makes LVLMs vulnerable to adversarial examples. In this paper, we formulate a novel and practical gray-box attack scenario that the adversary can only access the visual encoder of the victim LVLM, without the knowledge of its prompts (which are often proprietary for service providers and not publicly available) and its underlying large language model (LLM). This practical setting poses challenges to the cross-prompt and cross-model transferability of targeted adversarial attack, which aims to confuse the LVLM to output a response that is semantically similar to the attacker's chosen target text. To this end, we propose an instruction-tuned targeted attack (dubbed InstructTA) to deliver the targeted adversarial attack on LVLMs with high transferability. Initially, we utilize a public text-to-image generative model to "reverse" the target response into a target image, and employ GPT-4 to infer a reasonable instruction $\boldsymbol{p}^\prime$ from the target response. We then form a local surrogate model (sharing the same visual encoder with the victim LVLM) to extract instruction-aware features of an adversarial image example and the target image, and minimize the distance between these two features to optimize the adversarial example. To further improve the transferability, we augment the instruction $\boldsymbol{p}^\prime$ with instructions paraphrased from an LLM. Extensive experiments demonstrate the superiority of our proposed method in targeted attack performance and transferability.

翻译：大型视觉-语言模型（LVLMs）在图像理解与响应生成方面展现出非凡能力，然而这种丰富的视觉交互也使其易受对抗样本攻击。本文提出一种新颖且实用的灰盒攻击场景：攻击者仅能访问受害者LVLM的视觉编码器，无法获知其提示词（通常为服务提供商专有且不公开）及底层大语言模型（LLM）。这种实际场景对定向对抗攻击的跨提示词与跨模型迁移性构成挑战，此类攻击旨在混淆LVLM，使其输出语义上接近攻击者选定目标文本的响应。为此，我们提出一种指令调优定向攻击方法（称为InstructTA），以实现对LVLM高迁移性的定向对抗攻击。首先，利用公开文本到图像生成模型将目标响应“逆向”转换为目标图像，并采用GPT-4从目标响应中推断合理指令$\boldsymbol{p}^\prime$。随后构建与受害者LVLM共享相同视觉编码器的局部代理模型，提取对抗图像样本与目标图像的指令感知特征，并最小化二者特征距离以优化对抗样本。为提升迁移性，我们通过LLM释义生成的指令增强原始指令$\boldsymbol{p}^\prime$。大量实验表明，本方法在定向攻击性能与迁移性方面具有显著优势。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日