InstructTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models

Large vision-language models (LVLMs) have demonstrated their incredible capability in image understanding and response generation. However, this rich visual interaction also makes LVLMs vulnerable to adversarial examples. In this paper, we formulate a novel and practical gray-box attack scenario that the adversary can only access the visual encoder of the victim LVLM, without the knowledge of its prompts (which are often proprietary for service providers and not publicly available) and its underlying large language model (LLM). This practical setting poses challenges to the cross-prompt and cross-model transferability of targeted adversarial attack, which aims to confuse the LVLM to output a response that is semantically similar to the attacker's chosen target text. To this end, we propose an instruction-tuned targeted attack (dubbed InstructTA) to deliver the targeted adversarial attack on LVLMs with high transferability. Initially, we utilize a public text-to-image generative model to "reverse" the target response into a target image, and employ GPT-4 to infer a reasonable instruction $\boldsymbol{p}^\prime$ from the target response. We then form a local surrogate model (sharing the same visual encoder with the victim LVLM) to extract instruction-aware features of an adversarial image example and the target image, and minimize the distance between these two features to optimize the adversarial example. To further improve the transferability, we augment the instruction $\boldsymbol{p}^\prime$ with instructions paraphrased from an LLM. Extensive experiments demonstrate the superiority of our proposed method in targeted attack performance and transferability.

翻译：大型视觉语言模型在图像理解和响应生成方面展现了卓越的能力。然而，这种丰富的视觉交互也使得大型视觉语言模型易受对抗样本攻击。本文提出了一种新颖且实用的灰盒攻击场景：攻击者仅能访问受害者大型视觉语言模型的视觉编码器，而无法获知其提示词（通常为服务提供商专有且不公开）及其底层大语言模型。这一实际场景对定向对抗攻击的跨提示词与跨模型迁移性提出了挑战——该攻击旨在混淆大型视觉语言模型，使其输出语义上与攻击者选定目标文本相近的响应。为此，我们提出一种指令调优定向攻击方法（简称InstructTA），以实现对大型视觉语言模型的高迁移性定向攻击。首先，利用公开的文本到图像生成模型将目标响应“逆向”转换为目标图像，并通过GPT-4从目标响应中推断出合理指令$\boldsymbol{p}^\prime$。随后，构建一个与受害者大型视觉语言模型共享相同视觉编码器的本地替代模型，分别提取对抗图像示例与目标图像的指令感知特征，并最小化两者特征距离以优化对抗示例。为进一步提升迁移性，我们通过大语言模型对指令$\boldsymbol{p}^\prime$进行释义增强。大量实验证明了该方法在定向攻击性能与迁移性上的优越性。

相关内容

大语言模型

关注 67

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日