Due to their multimodal capabilities, Vision-Language Models (VLMs) have found numerous impactful applications in real-world scenarios. However, recent studies have revealed that VLMs are vulnerable to image-based adversarial attacks, particularly targeted adversarial images that manipulate the model to generate harmful content specified by the adversary. Current attack methods rely on predefined target labels to create targeted adversarial attacks, which limits their scalability and applicability for large-scale robustness evaluations. In this paper, we propose AnyAttack, a self-supervised framework that generates targeted adversarial images for VLMs without label supervision, allowing any image to serve as a target for the attack. Our framework employs the pre-training and fine-tuning paradigm, with the adversarial noise generator pre-trained on the large-scale LAION-400M dataset. This large-scale pre-training endows our method with powerful transferability across a wide range of VLMs. Extensive experiments on five mainstream open-source VLMs (CLIP, BLIP, BLIP2, InstructBLIP, and MiniGPT-4) across three multimodal tasks (image-text retrieval, multimodal classification, and image captioning) demonstrate the effectiveness of our attack. Additionally, we successfully transfer AnyAttack to multiple commercial VLMs, including Google Gemini, Claude Sonnet, Microsoft Copilot and OpenAI GPT. These results reveal an unprecedented risk to VLMs, highlighting the need for effective countermeasures.
翻译:由于具备多模态能力,视觉语言模型(VLMs)在现实场景中已获得大量重要应用。然而,近期研究表明,VLMs易受基于图像的对抗攻击影响,尤其是那些操纵模型生成攻击者指定的有害内容的有针对性对抗图像。现有攻击方法依赖预定义的目标标签来创建有针对性对抗攻击,这限制了其在大规模鲁棒性评估中的可扩展性和适用性。本文提出AnyAttack,一种无需标签监督即可为VLMs生成有针对性对抗图像的自监督框架,允许任意图像作为攻击目标。我们的框架采用预训练与微调范式,其中对抗噪声生成器在大型LAION-400M数据集上进行预训练。这种大规模预训练赋予我们的方法在广泛VLM范围内强大的可迁移性。在五种主流开源VLM(CLIP、BLIP、BLIP2、InstructBLIP和MiniGPT-4)上针对三项多模态任务(图文检索、多模态分类和图像描述生成)进行的广泛实验证明了我们攻击的有效性。此外,我们成功将AnyAttack迁移至多个商业VLM,包括Google Gemini、Claude Sonnet、Microsoft Copilot和OpenAI GPT。这些结果揭示了VLMs面临的前所未有的风险,凸显了制定有效防御措施的必要性。