Visual Language Models (VLMs) are vulnerable to adversarial attacks, especially those from adversarial images, which is however under-explored in literature. To facilitate research on this critical safety problem, we first construct a new laRge-scale Adervsarial images dataset with Diverse hArmful Responses (RADAR), given that existing datasets are either small-scale or only contain limited types of harmful responses. With the new RADAR dataset, we further develop a novel and effective iN-time Embedding-based AdveRSarial Image DEtection (NEARSIDE) method, which exploits a single vector that distilled from the hidden states of VLMs, which we call the attacking direction, to achieve the detection of adversarial images against benign ones in the input. Extensive experiments with two victim VLMs, LLaVA and MiniGPT-4, well demonstrate the effectiveness, efficiency, and cross-model transferrability of our proposed method. Our code is available at https://github.com/mob-scu/RADAR-NEARSIDE
翻译:视觉语言模型(VLMs)易受对抗性攻击,尤其是来自对抗性图像的攻击,然而现有文献对此研究不足。为推进这一关键安全问题的研究,鉴于现有数据集规模较小或仅包含有限类型的有害响应,我们首先构建了一个包含多样化有害响应的大规模对抗性图像数据集(RADAR)。基于新的RADAR数据集,我们进一步提出了一种新颖有效的即时嵌入式对抗性图像检测方法(NEARSIDE),该方法利用从VLM隐藏状态中提炼出的单一向量(我们称之为攻击方向),实现对输入中对抗性图像与良性图像的检测。通过对LLaVA和MiniGPT-4两种目标VLM的广泛实验,充分证明了所提方法的有效性、高效性及跨模型可迁移性。代码已开源:https://github.com/mob-scu/RADAR-NEARSIDE