Large Language Models (LLMs) have revolutionized natural language processing but remain vulnerable to jailbreak attacks, especially multi-turn jailbreaks that distribute malicious intent across benign exchanges and bypass alignment mechanisms. Existing approaches often explore the adversarial space poorly, rely on hand-crafted heuristics, or lack systematic query refinement. We present NEXUS (Network Exploration for eXploiting Unsafe Sequences), a modular framework for constructing, refining, and executing optimized multi-turn attacks. NEXUS comprises: (1) ThoughtNet, which hierarchically expands a harmful intent into a structured semantic network of topics, entities, and query chains; (2) a feedback-driven Simulator that iteratively refines and prunes these chains through attacker-victim-judge LLM collaboration using harmfulness and semantic-similarity benchmarks; and (3) a Network Traverser that adaptively navigates the refined query space for real-time attacks. This pipeline uncovers stealthy, high-success adversarial paths across LLMs. On several closed-source and open-source LLMs, NEXUS increases attack success rate by 2.1% to 19.4% over prior methods. Code: https://github.com/inspire-lab/NEXUS
翻译:大语言模型(LLMs)已彻底改变了自然语言处理领域,但仍易受越狱攻击,尤其是多轮越狱攻击。此类攻击将恶意意图分散在看似良性的对话轮次中,从而绕过模型的对齐机制。现有方法通常对对抗空间的探索不足,依赖手工设计的启发式规则,或缺乏系统性的查询优化。本文提出NEXUS(利用不安全序列的网络探索),这是一个用于构建、优化和执行多轮攻击的模块化框架。NEXUS包含三个核心组件:(1)ThoughtNet,它将有害意图层次化扩展为包含主题、实体和查询链的结构化语义网络;(2)反馈驱动的模拟器,通过攻击者-受害者-评判者LLM的协作,利用危害性和语义相似度基准迭代优化和剪枝这些查询链;(3)网络遍历器,它能自适应地在优化后的查询空间中导航以实施实时攻击。该流程能够揭示跨LLM的隐蔽且高成功率的对抗路径。在多个闭源和开源LLM上的实验表明,NEXUS将攻击成功率较现有方法提升了2.1%至19.4%。代码地址:https://github.com/inspire-lab/NEXUS