Due to their unprecedented ability to process and respond to various types of data, Multimodal Large Language Models (MLLMs) are constantly defining the new boundary of Artificial General Intelligence (AGI). As these advanced generative models increasingly form collaborative networks for complex tasks, the integrity and security of these systems are crucial. Our paper, ``The Wolf Within'', explores a novel vulnerability in MLLM societies - the indirect propagation of malicious content. Unlike direct harmful output generation for MLLMs, our research demonstrates how a single MLLM agent can be subtly influenced to generate prompts that, in turn, induce other MLLM agents in the society to output malicious content. Our findings reveal that, an MLLM agent, when manipulated to produce specific prompts or instructions, can effectively ``infect'' other agents within a society of MLLMs. This infection leads to the generation and circulation of harmful outputs, such as dangerous instructions or misinformation, across the society. We also show the transferability of these indirectly generated prompts, highlighting their possibility in propagating malice through inter-agent communication. This research provides a critical insight into a new dimension of threat posed by MLLMs, where a single agent can act as a catalyst for widespread malevolent influence. Our work underscores the urgent need for developing robust mechanisms to detect and mitigate such covert manipulations within MLLM societies, ensuring their safe and ethical utilization in societal applications.
翻译:由于多模态大语言模型(MLLMs)在处理和响应各类数据方面展现出前所未有的能力,它们正不断重新定义通用人工智能(AGI)的边界。随着这些先进生成模型日益形成用于复杂任务的协作网络,其系统的完整性与安全性变得至关重要。本文《狼群中的内奸》探讨了MLLM社会中一种新颖的脆弱性——恶意内容的间接传播。与直接使MLLM生成有害输出的方式不同,我们的研究展示了单个MLLM智能体如何被微妙地操控以生成提示,进而诱导社会中的其他MLLM智能体输出恶意内容。我们的发现揭示,当一个MLLM智能体被操纵以产生特定提示或指令时,它可以有效地“感染”MLLM社会中的其他智能体。这种感染导致有害输出(例如危险指令或错误信息)在整个社会中生成和传播。我们还证明了这些间接生成提示的可迁移性,凸显了其通过智能体间通信传播恶意的可能性。这项研究为理解MLLM构成的新维度威胁提供了关键见解,即单个智能体可能成为广泛恶意影响的催化剂。我们的工作强调,迫切需要开发鲁棒的机制来检测和缓解MLLM社会中的此类隐蔽操控,以确保其在社会应用中的安全与伦理使用。