Large Language Models (LLMs) demonstrate outstanding performance in their reservoir of knowledge and understanding capabilities, but they have also been shown to be prone to illegal or unethical reactions when subjected to jailbreak attacks. To ensure their responsible deployment in critical applications, it is crucial to understand the safety capabilities and vulnerabilities of LLMs. Previous works mainly focus on jailbreak in single-round dialogue, overlooking the potential jailbreak risks in multi-round dialogues, which are a vital way humans interact with and extract information from LLMs. Some studies have increasingly concentrated on the risks associated with jailbreak in multi-round dialogues. These efforts typically involve the use of manually crafted templates or prompt engineering techniques. However, due to the inherent complexity of multi-round dialogues, their jailbreak performance is limited. To solve this problem, we propose a novel multi-round dialogue jailbreaking agent, emphasizing the importance of stealthiness in identifying and mitigating potential threats to human values posed by LLMs. We propose a risk decomposition strategy that distributes risks across multiple rounds of queries and utilizes psychological strategies to enhance attack strength. Extensive experiments show that our proposed method surpasses other attack methods and achieves state-of-the-art attack success rate. We will make the corresponding code and dataset available for future research. The code will be released soon.
翻译:大型语言模型(LLMs)在其知识储备和理解能力方面展现出卓越的性能,但研究也表明,当遭受越狱攻击时,它们容易产生非法或不道德的回应。为确保其在关键应用中的负责任部署,理解LLMs的安全能力与漏洞至关重要。先前的工作主要关注单轮对话中的越狱,忽视了多轮对话中潜在的越狱风险,而多轮对话是人类与LLMs交互并从中提取信息的重要方式。一些研究日益关注多轮对话中与越狱相关的风险。这些工作通常涉及使用人工设计的模板或提示工程技术。然而,由于多轮对话固有的复杂性,其越狱性能有限。为解决此问题,我们提出了一种新颖的多轮对话越狱代理,强调在识别和缓解LLMs对人类价值观构成的潜在威胁时隐蔽性的重要性。我们提出了一种风险分解策略,将风险分散到多轮查询中,并利用心理策略来增强攻击强度。大量实验表明,我们提出的方法超越了其他攻击方法,并实现了最先进的攻击成功率。我们将公开相应的代码和数据集以供未来研究。代码即将发布。