Large Language Models (LLMs) remain vulnerable to multi-turn jailbreak attacks. We introduce HarmNet, a modular framework comprising ThoughtNet, a hierarchical semantic network; a feedback-driven Simulator for iterative query refinement; and a Network Traverser for real-time adaptive attack execution. HarmNet systematically explores and refines the adversarial space to uncover stealthy, high-success attack paths. Experiments across closed-source and open-source LLMs show that HarmNet outperforms state-of-the-art methods, achieving higher attack success rates. For example, on Mistral-7B, HarmNet achieves a 99.4% attack success rate, 13.9% higher than the best baseline. Index terms: jailbreak attacks; large language models; adversarial framework; query refinement.
翻译:大型语言模型(LLMs)仍然容易受到多轮越狱攻击。我们提出了HarmNet,这是一个模块化框架,包含:ThoughtNet(一种分层语义网络)、一个用于迭代查询优化的反馈驱动模拟器,以及一个用于实时自适应攻击执行的网络遍历器。HarmNet系统地探索并优化对抗空间,以发现隐蔽且高成功率的攻击路径。在闭源和开源LLMs上的实验表明,HarmNet优于现有最先进方法,实现了更高的攻击成功率。例如,在Mistral-7B上,HarmNet达到了99.4%的攻击成功率,比最佳基线高出13.9%。索引词:越狱攻击;大型语言模型;对抗框架;查询优化。