With the development of large language models, they are widely used as agents in various fields. A key component of agents is memory, which stores vital information but is susceptible to jailbreak attacks. Existing research mainly focuses on single-agent attacks and shared memory attacks. However, real-world scenarios often involve independent memory. In this paper, we propose the Troublemaker Makes Chaos in Honest Town (TMCHT) task, a large-scale, multi-agent, multi-topology text-based attack evaluation framework. TMCHT involves one attacker agent attempting to mislead an entire society of agents. We identify two major challenges in multi-agent attacks: (1) Non-complete graph structure, (2) Large-scale systems. We attribute these challenges to a phenomenon we term toxicity disappearing. To address these issues, we propose an Adversarial Replication Contagious Jailbreak (ARCJ) method, which optimizes the retrieval suffix to make poisoned samples more easily retrieved and optimizes the replication suffix to make poisoned samples have contagious ability. We demonstrate the superiority of our approach in TMCHT, with 23.51%, 18.95%, and 52.93% improvements in line topology, star topology, and 100-agent settings. Encourage community attention to the security of multi-agent systems.
翻译:随着大语言模型的发展,它们被广泛用作各个领域的智能体。智能体的一个关键组成部分是记忆,它存储着重要信息,但容易受到越狱攻击。现有研究主要集中于单智能体攻击和共享内存攻击。然而,现实场景通常涉及独立内存。在本文中,我们提出了"捣乱者在诚实小镇制造混乱"任务,这是一个大规模、多智能体、多拓扑结构的基于文本的攻击评估框架。TMCHT涉及一个攻击者智能体试图误导整个智能体社会。我们识别出多智能体攻击中的两大挑战:(1) 非完全图结构,(2) 大规模系统。我们将这些挑战归因于一种我们称为"毒性消失"的现象。为了解决这些问题,我们提出了一种对抗性复制传染性越狱方法,该方法通过优化检索后缀使中毒样本更容易被检索,并通过优化复制后缀使中毒样本具备传染能力。我们在TMCHT中证明了我们方法的优越性,在线性拓扑、星型拓扑和100智能体设置中分别实现了23.51%、18.95%和52.93%的性能提升。鼓励社区关注多智能体系统的安全性。