We demonstrate how images and sounds can be used for indirect prompt and instruction injection in multi-modal LLMs. An attacker generates an adversarial perturbation corresponding to the prompt and blends it into an image or audio recording. When the user asks the (unmodified, benign) model about the perturbed image or audio, the perturbation steers the model to output the attacker-chosen text and/or make the subsequent dialog follow the attacker's instruction. We illustrate this attack with several proof-of-concept examples targeting LLaVa and PandaGPT.
翻译:我们展示了如何利用图像和声音在多模态大语言模型中实现间接提示与指令注入。攻击者生成对应提示的对抗性扰动,并将其混入图像或音频记录中。当用户向(未经修改的、良性的)模型询问被扰动图像或音频的相关内容时,该扰动会引导模型输出攻击者选定的文本,并/或使后续对话遵循攻击者的指令。我们通过多个概念验证示例(针对LLaVa和PandaGPT模型)演示了这一攻击方法。