The rapid adoption of large language models (LLMs) in multi-agent systems has highlighted their impressive capabilities in various applications, such as collaborative problem-solving and autonomous negotiation. However, the security implications of these LLM-based multi-agent systems have not been thoroughly investigated, particularly concerning the spread of manipulated knowledge. In this paper, we investigate this critical issue by constructing a detailed threat model and a comprehensive simulation environment that mirrors real-world multi-agent deployments in a trusted platform. Subsequently, we propose a novel two-stage attack method involving Persuasiveness Injection and Manipulated Knowledge Injection to systematically explore the potential for manipulated knowledge (i.e., counterfactual and toxic knowledge) spread without explicit prompt manipulation. Our method leverages the inherent vulnerabilities of LLMs in handling world knowledge, which can be exploited by attackers to unconsciously spread fabricated information. Through extensive experiments, we demonstrate that our attack method can successfully induce LLM-based agents to spread both counterfactual and toxic knowledge without degrading their foundational capabilities during agent communication. Furthermore, we show that these manipulations can persist through popular retrieval-augmented generation frameworks, where several benign agents store and retrieve manipulated chat histories for future interactions. This persistence indicates that even after the interaction has ended, the benign agents may continue to be influenced by manipulated knowledge. Our findings reveal significant security risks in LLM-based multi-agent systems, emphasizing the imperative need for robust defenses against manipulated knowledge spread, such as introducing ``guardian'' agents and advanced fact-checking tools.
翻译:大型语言模型(LLM)在多智能体系统中的快速应用,突显了其在协作问题解决、自主协商等多种应用场景中的卓越能力。然而,这类基于LLM的多智能体系统的安全性影响尚未得到深入研究,尤其是在操纵知识传播方面。本文通过构建一个详细的威胁模型和一个模拟真实可信平台中多智能体部署的综合仿真环境,来探究这一关键问题。随后,我们提出了一种新颖的两阶段攻击方法,包括说服力注入和操纵知识注入,以系统性地探索在无需显式提示操纵的情况下,操纵知识(即反事实知识与有害知识)传播的可能性。该方法利用了LLM在处理世界知识时固有的脆弱性,攻击者可借此无意识地传播捏造的信息。通过大量实验,我们证明该攻击方法能够成功诱导基于LLM的智能体传播反事实及有害知识,且不会在智能体通信过程中削弱其基础能力。此外,我们还表明这些操纵行为能够通过流行的检索增强生成框架持续存在,其中多个良性智能体会存储并检索被操纵的聊天历史以供未来交互使用。这种持续性意味着即使交互已经结束,良性智能体仍可能持续受到操纵知识的影响。我们的研究结果揭示了基于LLM的多智能体系统中存在的重大安全风险,强调了建立针对操纵知识传播的强健防御机制(例如引入“守护者”智能体和先进的事实核查工具)的迫切必要性。