Individual Large Language Models (LLMs) have demonstrated significant capabilities across various domains, such as healthcare and law. Recent studies also show that coordinated multi-agent systems exhibit enhanced decision-making and reasoning abilities through collaboration. However, due to the vulnerabilities of individual LLMs and the difficulty of accessing all agents in a multi-agent system, a key question arises: If attackers only know one agent, could they still generate adversarial samples capable of misleading the collective decision? To explore this question, we formulate it as a game with incomplete information, where attackers know only one target agent and lack knowledge of the other agents in the system. With this formulation, we propose M-Spoiler, a framework that simulates agent interactions within a multi-agent system to generate adversarial samples. These samples are then used to manipulate the target agent in the target system, misleading the system's collaborative decision-making process. More specifically, M-Spoiler introduces a stubborn agent that actively aids in optimizing adversarial samples by simulating potential stubborn responses from agents in the target system. This enhances the effectiveness of the generated adversarial samples in misleading the system. Through extensive experiments across various tasks, our findings confirm the risks posed by the knowledge of an individual agent in multi-agent systems and demonstrate the effectiveness of our framework. We also explore several defense mechanisms, showing that our proposed attack framework remains more potent than baselines, underscoring the need for further research into defensive strategies.
翻译:个体大型语言模型(LLM)已在医疗和法律等多个领域展现出显著能力。近期研究亦表明,协同的多智能体系统通过协作可提升决策与推理能力。然而,由于个体LLM存在脆弱性,且多智能体系统中难以访问所有智能体,一个关键问题随之产生:若攻击者仅知晓其中一个智能体,是否仍能生成误导集体决策的对抗样本?为探究此问题,我们将其建模为不完全信息博弈,其中攻击者仅了解目标智能体,而对系统中其他智能体一无所知。基于此模型,我们提出M-Spoiler框架,通过模拟多智能体系统内的智能体交互来生成对抗样本。这些样本随后被用于操纵目标系统中的目标智能体,从而误导系统的协同决策过程。具体而言,M-Spoiler引入了一个顽固智能体,通过模拟目标系统中智能体可能采取的顽固响应,主动协助优化对抗样本。这增强了所生成对抗样本在误导系统方面的有效性。通过在多种任务上进行大量实验,我们的研究结果证实了多智能体系统中个体智能体知识所带来的风险,并验证了本框架的有效性。我们还探讨了多种防御机制,结果表明所提出的攻击框架仍比基线方法更具威力,这凸显了对防御策略进行进一步研究的必要性。