Multi-modal foundation models like OpenFlamingo, LLaVA, and GPT-4 are increasingly used for various real-world tasks. Prior work has shown that these models are highly vulnerable to adversarial attacks on the vision modality. These attacks can be leveraged to spread fake information or defraud users, and thus pose a significant risk, which makes the robustness of large multi-modal foundation models a pressing problem. The CLIP model, or one of its variants, is used as a frozen vision encoder in many vision-language models (VLMs), e.g. LLaVA and OpenFlamingo. We propose an unsupervised adversarial fine-tuning scheme to obtain a robust CLIP vision encoder, which yields robustness on all vision down-stream tasks (VLMs, zero-shot classification) that rely on CLIP. In particular, we show that stealth-attacks on users of VLMs by a malicious third party providing manipulated images are no longer possible once one replaces the original CLIP model with our robust one. No retraining or fine-tuning of the VLM is required. The code and robust models are available at https://github.com/chs20/RobustVLM
翻译:多模态基础模型(如OpenFlamingo、LLaVA和GPT-4)正被日益广泛地应用于各类现实任务。已有研究表明,这类模型在视觉模态上极易遭受对抗攻击。这些攻击可被用于传播虚假信息或欺诈用户,构成重大安全风险,使得大型多模态基础模型的鲁棒性成为亟待解决的问题。CLIP模型或其变体作为冻结的视觉编码器被应用于许多视觉语言模型(VLM),例如LLaVA和OpenFlamingo。本文提出一种无监督对抗微调方案,以获取鲁棒的CLIP视觉编码器,该编码器能为所有依赖CLIP的视觉下游任务(包括VLM、零样本分类)提供鲁棒性保障。特别地,我们证明:一旦将原始CLIP模型替换为我们的鲁棒模型,恶意第三方通过提供被操纵图像对VLM用户实施的隐蔽攻击将不再奏效。该过程无需对VLM进行重新训练或微调。相关代码和鲁棒模型已开源在https://github.com/chs20/RobustVLM