Large Language Models (LLMs) have been demonstrated to generate illegal or unethical responses, particularly when subjected to "jailbreak." Research on jailbreak has highlighted the safety issues of LLMs. However, prior studies have predominantly focused on single-turn dialogue, ignoring the potential complexities and risks presented by multi-turn dialogue, a crucial mode through which humans derive information from LLMs. In this paper, we argue that humans could exploit multi-turn dialogue to induce LLMs into generating harmful information. LLMs may not intend to reject cautionary or borderline unsafe queries, even if each turn is closely served for one malicious purpose in a multi-turn dialogue. Therefore, by decomposing an unsafe query into several sub-queries for multi-turn dialogue, we induced LLMs to answer harmful sub-questions incrementally, culminating in an overall harmful response. Our experiments, conducted across a wide range of LLMs, indicate current inadequacies in the safety mechanisms of LLMs in multi-turn dialogue. Our findings expose vulnerabilities of LLMs in complex scenarios involving multi-turn dialogue, presenting new challenges for the safety of LLMs.
翻译:大语言模型已被证实会生成违法或不道德的回复,尤其当遭受"越狱攻击"时。针对越狱攻击的研究揭示了LLM的安全隐患。然而,现有研究主要聚焦于单轮对话场景,忽视了多轮对话——这一人类从LLM获取信息的关键交互模式——所蕴含的潜在复杂性与风险。本文论证了人类可能通过多轮对话诱导LLM生成有害信息。即使在多轮对话中每轮交互均服务于同一恶意目的,LLM也可能不会拒绝具有警示性或处于安全边界模糊状态的查询。据此,通过将不安全查询分解为多个子查询以构建多轮对话,我们逐步诱导LLM回答有害子问题,最终形成完整的危害性响应。我们在涵盖多种LLM的实验中表明,当前LLM在多轮对话场景下的安全机制存在缺陷。我们的发现揭示了LLM在涉及多轮对话的复杂场景中的脆弱性,为LLM的安全性提出了新挑战。