Are foundation models secure from malicious actors? In this work, we focus on the image input to a vision-language model (VLM). We discover image hijacks, adversarial images that control generative models at runtime. We introduce Behavior Matching, a general method for creating image hijacks, and we use it to explore three types of attacks. Specific string attacks generate arbitrary output of the adversary's choosing. Leak context attacks leak information from the context window into the output. Jailbreak attacks circumvent a model's safety training. We study these attacks against LLaVA-2, a state-of-the-art VLM based on CLIP and LLaMA-2, and find that all our attack types have above a 90\% success rate. Moreover, our attacks are automated and require only small image perturbations. These findings raise serious concerns about the security of foundation models. If image hijacks are as difficult to defend against as adversarial examples in CIFAR-10, then it might be many years before a solution is found -- if it even exists.
翻译:基础模型是否能够抵御恶意行为者?在这项工作中,我们聚焦于视觉-语言模型(VLM)的图像输入。我们发现了图像劫持——一种能够在运行时控制生成模型的对抗性图像。我们引入了行为匹配(Behavior Matching),一种创建图像劫持的通用方法,并利用它探讨了三种攻击类型。特定字符串攻击可生成攻击者任意选择的输出;上下文泄露攻击能从上下文窗口中窃取信息到输出中;越狱攻击则绕过模型的安全训练。我们针对基于CLIP和LLaMA-2的先进VLM——LLaVA-2研究了这些攻击,发现所有攻击类型的成功率均超过90%。此外,我们的攻击是自动化的,且仅需微小的图像扰动。这些发现对基础模型的安全性提出了严重关切。如果图像劫持像CIFAR-10中的对抗样本一样难以防御,那么找到解决方案可能需要多年时间——甚至可能根本不存在。