The availability of Large Language Models (LLMs) has led to a new generation of powerful chatbots that can be developed at relatively low cost. As companies deploy these tools, security challenges need to be addressed to prevent financial loss and reputational damage. A key security challenge is jailbreaking, the malicious manipulation of prompts and inputs to bypass a chatbot's safety guardrails. Multi-turn attacks are a relatively new form of jailbreaking involving a carefully crafted chain of interactions with a chatbot. We introduce Echo Chamber, a new multi-turn attack using a gradual escalation method. We describe this attack in detail, compare it to other multi-turn attacks, and demonstrate its performance against multiple state-of-the-art models through extensive evaluation.
翻译:大型语言模型(LLMs)的普及催生了新一代功能强大且开发成本相对较低的聊天机器人。随着企业广泛部署此类工具,必须解决相应的安全挑战以避免经济损失和声誉损害。其中关键的安全挑战是越狱攻击——即通过恶意操纵提示词与输入来绕过聊天机器人的安全防护机制。多轮攻击作为一种相对新颖的越狱形式,涉及与聊天机器人进行精心设计的连续交互链。本文提出"回声室"攻击,这是一种采用渐进式升级策略的新型多轮攻击方法。我们将详细阐述该攻击机制,与其他多轮攻击进行对比,并通过大量实验评估其在多个前沿模型上的攻击效果。