We demonstrate how images and sounds can be used for indirect prompt and instruction injection in multi-modal LLMs. An attacker generates an adversarial perturbation corresponding to the prompt and blends it into an image or audio recording. When the user asks the (unmodified, benign) model about the perturbed image or audio, the perturbation steers the model to output the attacker-chosen text and/or make the subsequent dialog follow the attacker's instruction. We illustrate this attack with several proof-of-concept examples targeting LLaVa and PandaGPT.
翻译:我们展示了如何利用图像和声音对多模态大语言模型进行间接提示与指令注入攻击。攻击者生成与特定提示对应的对抗性扰动,并将其嵌入图像或音频记录中。当用户就受扰动图像或音频向(未经修改的良性)模型提问时,该扰动会诱导模型输出攻击者指定的文本,并/或使后续对话遵循攻击者设定的指令。我们通过多个概念验证示例(针对LLaVa和PandaGPT模型)演示了该攻击方法。