Backdoor attacks are commonly executed by contaminating training data, such that a trigger can activate predetermined harmful effects during the test phase. In this work, we present AnyDoor, a test-time backdoor attack against multimodal large language models (MLLMs), which involves injecting the backdoor into the textual modality using adversarial test images (sharing the same universal perturbation), without requiring access to or modification of the training data. AnyDoor employs similar techniques used in universal adversarial attacks, but distinguishes itself by its ability to decouple the timing of setup and activation of harmful effects. In our experiments, we validate the effectiveness of AnyDoor against popular MLLMs such as LLaVA-1.5, MiniGPT-4, InstructBLIP, and BLIP-2, as well as provide comprehensive ablation studies. Notably, because the backdoor is injected by a universal perturbation, AnyDoor can dynamically change its backdoor trigger prompts/harmful effects, exposing a new challenge for defending against backdoor attacks. Our project page is available at https://sail-sg.github.io/AnyDoor/.
翻译:后门攻击通常通过污染训练数据实现,使得触发器能够在测试阶段激活预定的有害效果。在本文中,我们提出AnyDoor,一种针对多模态大语言模型(MLLMs)的测试时后门攻击方法。该方法通过使用对抗性测试图像(共享相同通用扰动)将后门注入文本模态,无需访问或修改训练数据。AnyDoor采用与通用对抗攻击类似的技术,但区别在于其能够解耦有害效果的设置与激活时机。在我们的实验中,我们验证了AnyDoor对LLaVA-1.5、MiniGPT-4、InstructBLIP和BLIP-2等主流MLLMs的有效性,并提供了全面的消融研究。值得注意的是,由于后门通过通用扰动注入,AnyDoor能够动态改变其后门触发器提示/有害效果,这对后门攻击防御提出了新的挑战。我们的项目页面见https://sail-sg.github.io/AnyDoor/。