Jailbreak attacks expose persistent safety weaknesses in large language models (LLMs), but existing stateless single-turn methods face a trade-off: hand-crafted prompts are expressive but static, while iterative prompt optimization can adapt but often relies on low-level mutations that require many target queries. We propose JailbreakOPT, a tool-assisted framework for improving iterative single-turn jailbreak prompt optimization. JailbreakOPT organizes diverse atomic jailbreak prompts into an attack tool library and composes them through a unified intra-episode optimization abstraction to generate stronger standalone attack prompts. To reuse experience across attack episodes, JailbreakOPT further frames tool selection as a contextual bandit problem and applies contextual Thompson sampling to guide exploration and exploitation based on past outcomes. Experiments across multiple target LLMs and attack goals show that JailbreakOPT improves attack success rate (ASR) while reducing the number of attacks until success (No.A) compared with atomic single-turn attacks and existing iterative optimization baselines. This paper may contain offensive or harmful content.
翻译:越狱攻击暴露了大语言模型(LLMs)中持续存在的安全性弱点,但现有的无状态单轮方法面临权衡:手工制作的提示具有表现力但静态,而迭代提示优化可以自适应,但通常依赖于低层变异,需要大量目标查询。我们提出JailbreakOPT,一个用于改进迭代单轮越狱提示优化的工具辅助框架。JailbreakOPT将多样化的原子越狱提示组织成攻击工具库,并通过统一的阶段内优化抽象组合它们,以生成更强的独立攻击提示。为了跨攻击阶段重用经验,JailbreakOPT进一步将工具选择建模为上下文老虎机问题,并应用上下文汤普森采样基于过去结果指导探索与利用。在多个目标LLMs和攻击目标上的实验表明,与原子单轮攻击和现有迭代优化基线相比,JailbreakOPT提高了攻击成功率(ASR),同时减少了成功所需的攻击次数(No.A)。本文可能包含冒犯性或有害内容。