NEXUS：利用多轮大语言模型越狱中不安全序列的网络探索 (NEXUS: Network Exploration for eXploiting Unsafe Sequences in Multi-Turn LLM Jailbreaks)

from arxiv, This paper has been accepted in the main conference proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025). Javad Rafiei Asl and Sidhant Narula are co-first authors

Large Language Models (LLMs) have revolutionized natural language processing but remain vulnerable to jailbreak attacks, especially multi-turn jailbreaks that distribute malicious intent across benign exchanges and bypass alignment mechanisms. Existing approaches often explore the adversarial space poorly, rely on hand-crafted heuristics, or lack systematic query refinement. We present NEXUS (Network Exploration for eXploiting Unsafe Sequences), a modular framework for constructing, refining, and executing optimized multi-turn attacks. NEXUS comprises: (1) ThoughtNet, which hierarchically expands a harmful intent into a structured semantic network of topics, entities, and query chains; (2) a feedback-driven Simulator that iteratively refines and prunes these chains through attacker-victim-judge LLM collaboration using harmfulness and semantic-similarity benchmarks; and (3) a Network Traverser that adaptively navigates the refined query space for real-time attacks. This pipeline uncovers stealthy, high-success adversarial paths across LLMs. On several closed-source and open-source LLMs, NEXUS increases attack success rate by 2.1% to 19.4% over prior methods. Code: https://github.com/inspire-lab/NEXUS

翻译：大语言模型（LLMs）已彻底改变了自然语言处理领域，但仍易受越狱攻击，尤其是多轮越狱攻击。此类攻击将恶意意图分散在看似良性的对话轮次中，从而绕过模型的对齐机制。现有方法通常对对抗空间的探索不足，依赖手工设计的启发式规则，或缺乏系统性的查询优化。本文提出NEXUS（利用不安全序列的网络探索），这是一个用于构建、优化和执行多轮攻击的模块化框架。NEXUS包含三个核心组件：（1）ThoughtNet，它将有害意图层次化扩展为包含主题、实体和查询链的结构化语义网络；（2）反馈驱动的模拟器，通过攻击者-受害者-评判者LLM的协作，利用危害性和语义相似度基准迭代优化和剪枝这些查询链；（3）网络遍历器，它能自适应地在优化后的查询空间中导航以实施实时攻击。该流程能够揭示跨LLM的隐蔽且高成功率的对抗路径。在多个闭源和开源LLM上的实验表明，NEXUS将攻击成功率较现有方法提升了2.1%至19.4%。代码地址：https://github.com/inspire-lab/NEXUS

相关内容

Networking

关注 22

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日