We introduce a new type of indirect injection vulnerabilities in language models that operate on images: hidden "meta-instructions" that influence how the model interprets the image and steer the model's outputs to express an adversary-chosen style, sentiment, or point of view. We explain how to create meta-instructions by generating images that act as soft prompts. Unlike jailbreaking attacks and adversarial examples, the outputs resulting from these images are plausible and based on the visual content of the image, yet follow the adversary's (meta-)instructions. We describe the risks of these attacks, including misinformation and spin, evaluate their efficacy for multiple visual language models and adversarial meta-objectives, and demonstrate how they can "unlock" the capabilities of the underlying language models that are unavailable via explicit text instructions. Finally, we discuss defenses against these attacks.
翻译:我们揭示了一种在基于图像的语言模型中新型的间接注入漏洞:隐藏的“元指令”,它们能够影响模型对图像的理解,并引导模型输出以表达攻击者选定的风格、情感或观点。我们阐释了如何通过生成充当软提示的图像来创建此类元指令。与越狱攻击和对抗样本不同,由这些图像产生的输出是合理的且基于图像的视觉内容,但同时遵循攻击者的(元)指令。我们描述了此类攻击的风险,包括虚假信息和倾向性引导,评估了其在多种视觉语言模型和对抗性元目标上的有效性,并展示了它们如何能够“解锁”底层语言模型中通过显式文本指令无法调用的能力。最后,我们探讨了针对此类攻击的防御策略。