LLMs are increasingly equipped with safety alignment mechanisms, yet recent studies demonstrate that they remain vulnerable to jailbreaking attacks that elicit harmful behaviors without explicit policy violations. While a growing body of work has explored automated jailbreak strategies, existing methods face several fundamental challenges, including the lack of systematic utilization of both successful and failed attack experiences, as well as the absence of principled mechanisms for composing and selecting reusable attack rules under diverse constraints. As a result, existing methods struggle to accumulate transferable knowledge over time and to reliably adapt attack strategies across different targets and evolving safety mechanisms. To address these issues, we propose a Self-Evolving Rule-Driven Training-Free Jailbreak (SRTJ) framework that systematically discovers, composes, and refines attack strategies through interaction and feedback, without updating model parameters. Specifically, SRTJ couples experience-driven attack generation with answer set programming (ASP)-based rule selection and constraint-aware composition, where iterative verifier feedback is leveraged to jointly refine successful strategies and analyze failure patterns. The resulting rule memory evolves in a hierarchical multi-level manner, explicitly organizing distilled attack knowledge into long-term, middle-term, and short-term rules, thereby capturing both stable transferable strategies and transient adaptive behaviors to effectively balance exploration and exploitation across attack attempts. Extensive experiments on mainstream jailbreak benchmark (HarmBench) demonstrate that SRTJ achieves strong and stable attack performance across different target LLMs, while exhibiting improved robustness and generalization compared to existing jailbreak methods. The code is available at https://github.com/TheSolkatt/SRTJ.
翻译:大语言模型日益配备安全对齐机制,但近期研究表明它们仍易受越狱攻击,该攻击能在不违反明确政策的情况下诱发有害行为。尽管已有大量工作探索自动化越狱策略,现有方法仍面临根本性挑战:既缺乏对成功与失败攻击经验的系统性利用,也缺少在多样化约束条件下组合与选择可复用攻击规则的原则性机制。因此,现有方法难以随时间积累可迁移知识,也无法针对不同目标和演进的安全机制可靠调整攻击策略。针对这些问题,我们提出自进化规则驱动免训练越狱框架(SRTJ),该框架通过交互与反馈系统性发现、组合和优化攻击策略,且无需更新模型参数。具体而言,SRTJ将经验驱动的攻击生成与基于回答集编程(ASP)的规则选择及约束感知组合相结合,通过迭代验证器反馈共同优化成功策略并分析失败模式。由此产生的规则记忆以分层多级方式进化,将蒸馏后的攻击知识显式组织为长时、中时和短时规则,从而同时捕获稳定可迁移策略与瞬态自适应行为,有效平衡多次攻击尝试中的探索与利用。在主流越狱基准(HarmBench)上的大量实验表明,SRTJ在不同目标大语言模型上均能实现强劲且稳定的攻击性能,相较于现有越狱方法展现出更强的鲁棒性和泛化能力。代码发布于https://github.com/TheSolkatt/SRTJ。