Large Language Models (LLMs) have been demonstrated to generate illegal or unethical responses, particularly when subjected to "jailbreak." Research on jailbreak has highlighted the safety issues of LLMs. However, prior studies have predominantly focused on single-turn dialogue, ignoring the potential complexities and risks presented by multi-turn dialogue, a crucial mode through which humans derive information from LLMs. In this paper, we argue that humans could exploit multi-turn dialogue to induce LLMs into generating harmful information. LLMs may not intend to reject cautionary or borderline unsafe queries, even if each turn is closely served for one malicious purpose in a multi-turn dialogue. Therefore, by decomposing an unsafe query into several sub-queries for multi-turn dialogue, we induced LLMs to answer harmful sub-questions incrementally, culminating in an overall harmful response. Our experiments, conducted across a wide range of LLMs, indicate current inadequacies in the safety mechanisms of LLMs in multi-turn dialogue. Our findings expose vulnerabilities of LLMs in complex scenarios involving multi-turn dialogue, presenting new challenges for the safety of LLMs.
翻译:大型语言模型(LLMs)已被证实可能生成违法或不道德的回应,尤其是在遭遇"越狱"攻击时。现有关于越狱的研究揭示了LLMs的安全问题,然而这些研究主要集中于单轮对话场景,忽视了多轮对话这一人类从LLMs获取信息的关键模式所蕴含的复杂性与风险。本文提出,攻击者可能通过多轮对话诱导LLMs生成有害信息。即使多轮对话中每个话轮都服务于同一恶意目的,LLMs也可能不会拒绝那些具有警示性或处于安全边界的查询。因此,通过将不安全查询分解为若干子查询进行多轮对话,我们成功诱导LLMs逐步回答有害子问题,最终形成整体有害响应。我们在多种LLMs上开展的实验表明,当前LLMs在多轮对话中的安全机制存在明显缺陷。本研究揭示了LLMs在复杂多轮对话场景中的脆弱性,为LLMs安全性研究提出了新的挑战。