A multimodal large language model (MLLM) agent can receive instructions, capture images, retrieve histories from memory, and decide which tools to use. Nonetheless, red-teaming efforts have revealed that adversarial images/prompts can jailbreak an MLLM and cause unaligned behaviors. In this work, we report an even more severe safety issue in multi-agent environments, referred to as infectious jailbreak. It entails the adversary simply jailbreaking a single agent, and without any further intervention from the adversary, (almost) all agents will become infected exponentially fast and exhibit harmful behaviors. To validate the feasibility of infectious jailbreak, we simulate multi-agent environments containing up to one million LLaVA-1.5 agents, and employ randomized pair-wise chat as a proof-of-concept instantiation for multi-agent interaction. Our results show that feeding an (infectious) adversarial image into the memory of any randomly chosen agent is sufficient to achieve infectious jailbreak. Finally, we derive a simple principle for determining whether a defense mechanism can provably restrain the spread of infectious jailbreak, but how to design a practical defense that meets this principle remains an open question to investigate. Our project page is available at https://sail-sg.github.io/Agent-Smith/.
翻译:多模态大语言模型(MLLM)智能体可接收指令、捕获图像、从记忆中检索历史记录并决定使用哪些工具。然而,红队测试已揭示对抗性图像/提示可越狱MLLM并引发非对齐行为。本工作中,我们报告了多智能体环境中更严重的安全问题——感染性越狱。该问题表现为:攻击者仅需越狱单个智能体,无需额外干预,(几乎)所有智能体便会指数级快速被感染并展现有害行为。为验证感染性越狱可行性,我们模拟了包含多达百万个LLaVA-1.5智能体的多智能体环境,采用随机配对聊天作为多智能体交互的概念验证实例。结果表明:将(感染性)对抗图像注入任意随机选中智能体的记忆,即足以实现感染性越狱。最后,我们推导出判断防御机制能否可证明遏制感染性越狱传播的简单原则,但如何设计符合此原则的实用防御仍是有待研究的开放问题。项目页面:https://sail-sg.github.io/Agent-Smith/。