Large language models (LLMs) have exhibited outstanding performance in engaging with humans and addressing complex questions by leveraging their vast implicit knowledge and robust reasoning capabilities. However, such models are vulnerable to jailbreak attacks, leading to the generation of harmful responses. Despite recent research on single-turn jailbreak strategies to facilitate the development of defence mechanisms, the challenge of revealing vulnerabilities under multi-turn setting remains relatively under-explored. In this work, we propose Jigsaw Puzzles (JSP), a straightforward yet effective multi-turn jailbreak strategy against the advanced LLMs. JSP splits questions into harmless fractions as the input of each turn, and requests LLMs to reconstruct and respond to questions under multi-turn interaction. Our experimental results demonstrate that the proposed JSP jailbreak bypasses original safeguards against explicitly harmful content, achieving an average attack success rate of 93.76% on 189 harmful queries across 5 advanced LLMs (Gemini-1.5-Pro, Llama-3.1-70B, GPT-4, GPT-4o, GPT-4o-mini). Moreover, JSP achieves a state-of-the-art attack success rate of 92% on GPT-4 on the harmful query benchmark, and exhibits strong resistant to defence strategies. Warning: this paper contains offensive examples.
翻译:大型语言模型(LLM)凭借其庞大的隐性知识和强大的推理能力,在与人交互和解决复杂问题方面表现出卓越的性能。然而,此类模型容易受到越狱攻击,导致生成有害回复。尽管近期已有研究关注单轮越狱策略以促进防御机制的开发,但在多轮交互设置下揭示模型脆弱性的挑战仍相对缺乏探索。在本工作中,我们提出拼图攻击(Jigsaw Puzzles,JSP),一种针对先进LLM的简单而有效的多轮越狱策略。JSP将问题拆分为无害的片段作为每轮输入,并要求LLM在多轮交互中重构问题并作出回应。我们的实验结果表明,所提出的JSP越狱方法能够绕过针对显式有害内容的原始安全防护,在5个先进LLM(Gemini-1.5-Pro、Llama-3.1-70B、GPT-4、GPT-4o、GPT-4o-mini)上对189个有害查询的平均攻击成功率达到93.76%。此外,JSP在有害查询基准测试中对GPT-4的攻击成功率达到了92%的先进水平,并对防御策略表现出较强的抵抗能力。警告:本文包含具有冒犯性的示例。