Rapid advancements in large language models (LLMs) have revitalized in LLM-based agents, exhibiting impressive human-like behaviors and cooperative capabilities in various scenarios. However, these agents also bring some exclusive risks, stemming from the complexity of interaction environments and the usability of tools. This paper delves into the safety of LLM-based agents from three perspectives: agent quantity, role definition, and attack level. Specifically, we initially propose to employ a template-based attack strategy on LLM-based agents to find the influence of agent quantity. In addition, to address interaction environment and role specificity issues, we introduce Evil Geniuses (EG), an effective attack method that autonomously generates prompts related to the original role to examine the impact across various role definitions and attack levels. EG leverages Red-Blue exercises, significantly improving the generated prompt aggressiveness and similarity to original roles. Our evaluations on CAMEL, Metagpt and ChatDev based on GPT-3.5 and GPT-4, demonstrate high success rates. Extensive evaluation and discussion reveal that these agents are less robust, prone to more harmful behaviors, and capable of generating stealthier content than LLMs, highlighting significant safety challenges and guiding future research. Our code is available at https://github.com/T1aNS1R/Evil-Geniuses.
翻译:大型语言模型(LLM)的快速发展催生了基于LLM的智能体,这些智能体在不同场景中展现出令人印象深刻的人类化行为与合作能力。然而,由于交互环境的复杂性及工具的易用性,这些智能体也带来了一系列独特风险。本文从智能体数量、角色定义和攻击层次三个维度深入探究基于LLM的智能体安全性。具体而言,我们首先提出采用基于模板的攻击策略对基于LLM的智能体进行测试,以探究智能体数量的影响。此外,为应对交互环境与角色特异性问题,我们引入“邪恶天才”(Evil Geniuses, EG)这一有效攻击方法,该方法能自主生成与原始角色相关的提示词,以检验不同角色定义和攻击层次的影响。EG采用红蓝对抗演练,显著提升了生成提示词的攻击性与原始角色相似度。我们在基于GPT-3.5和GPT-4的CAMEL、Metagpt及ChatDev平台上的评估展现了高成功率。广泛评估与讨论表明,与LLM相比,这些智能体鲁棒性更弱、更易产生有害行为,并能生成更具隐蔽性的内容,凸显了重大安全挑战并为未来研究提供指引。我们的代码开源在https://github.com/T1aNS1R/Evil-Geniuses。