Large Language Models (LLMs) have become integral to many domains, making their safety a critical priority. Prior jailbreaking research has explored diverse approaches, including prompt optimization, automated red teaming, obfuscation, and reinforcement learning (RL) based methods. However, most existing techniques fail to effectively leverage vulnerabilities revealed in earlier interaction turns, resulting in inefficient and unstable attacks. Since jailbreaking involves sequential interactions in which each response influences future actions, reinforcement learning provides a natural framework for this problem. Motivated by this, we propose a history-aware RL-based jailbreak framework that analyzes and reweights vulnerability signals from prior steps to guide future decisions. We show that incorporating historical information alone improves jailbreak success rates. Building on this insight, we introduce an attention-based reweighting mechanism that highlights critical vulnerabilities within the interaction history, enabling more efficient exploration with fewer queries. Extensive experiments on AdvBench and HarmBench demonstrate that our method achieves state-of-the-art jailbreak performance while significantly improving query efficiency. These results underscore the importance of historical vulnerability signals in reinforcement learning-driven jailbreak strategies and offer a principled pathway for advancing adversarial research on LLM safeguards.
翻译:大语言模型(LLM)已成为众多领域不可或缺的组成部分,其安全性因此成为关键优先事项。先前的越狱研究已探索了多种方法,包括提示优化、自动化红队测试、混淆技术以及基于强化学习(RL)的方法。然而,现有技术大多未能有效利用早期交互轮次中暴露的漏洞,导致攻击效率低下且不稳定。由于越狱涉及顺序交互过程,其中每个响应都会影响后续动作,强化学习为此问题提供了自然的框架。基于此动机,我们提出一种基于强化学习的历史感知越狱框架,该框架通过分析并重新加权历史步骤中的漏洞信号来指导后续决策。我们证明,仅通过引入历史信息即可提升越狱成功率。在此基础上,我们进一步提出一种基于注意力的重加权机制,该机制能够突出交互历史中的关键漏洞,从而以更少的查询实现更高效的探索。在AdvBench和HarmBench数据集上的大量实验表明,我们的方法在实现最先进越狱性能的同时,显著提升了查询效率。这些结果凸显了历史漏洞信号在强化学习驱动的越狱策略中的重要性,并为推进大语言模型防护机制的对抗性研究提供了系统化路径。