Large Vision-Language Models (LVLMs) are susceptible to typographic attacks, which are misclassifications caused by an attack text that is added to an image. In this paper, we introduce a multi-image setting for studying typographic attacks, broadening the current emphasis of the literature on attacking individual images. Specifically, our focus is on attacking image sets without repeating the attack query. Such non-repeating attacks are stealthier, as they are more likely to evade a gatekeeper than attacks that repeat the same attack text. We introduce two attack strategies for the multi-image setting, leveraging the difficulty of the target image, the strength of the attack text, and text-image similarity. Our text-image similarity approach improves attack success rates by 21% over random, non-specific methods on the CLIP model using ImageNet while maintaining stealth in a multi-image scenario. An additional experiment demonstrates transferability, i.e., text-image similarity calculated using CLIP transfers when attacking InstructBLIP.
翻译:大型视觉-语言模型(LVLMs)容易受到印刷体攻击的影响,这种攻击通过在图像中添加攻击性文本来引发错误分类。本文引入了一种多图像场景来研究印刷体攻击,拓宽了当前文献中侧重于攻击单个图像的研究范畴。具体而言,我们的研究重点是在不重复攻击查询的情况下攻击图像集合。这种非重复性攻击更具隐蔽性,因为它们比重复相同攻击文本的攻击更有可能绕过审查机制。我们针对多图像场景提出了两种攻击策略,利用目标图像的难度、攻击文本的强度以及文本-图像相似性。我们的文本-图像相似性方法在CLIP模型上使用ImageNet数据集时,将攻击成功率比随机的非针对性方法提高了21%,同时保持了多图像场景下的隐蔽性。一项额外实验证明了可迁移性,即使用CLIP计算的文本-图像相似性在攻击InstructBLIP时同样有效。