Recent reasoning-based safety guardrails for Large Reasoning Models (LRMs), such as deliberative alignment, have shown strong defense against jailbreak attacks. By leveraging LRMs' reasoning ability, these guardrails help the models to assess the safety of user inputs before generating final responses. The powerful reasoning ability can analyze the intention of the input query and will refuse to assist once it detects the harmful intent hidden by the jailbreak methods. Such guardrails have shown a significant boost in defense, such as the near-perfect refusal rates on the open-source gpt-oss series. Unfortunately, we find that these powerful reasoning-based guardrails can be extremely vulnerable to subtle manipulation of the input prompts, and once hijacked, can lead to even more harmful results. Specifically, we first uncover a surprisingly fragile aspect of these guardrails: simply adding a few template tokens to the input prompt can successfully bypass the seemingly powerful guardrails and lead to explicit and harmful responses. To explore further, we introduce a bag of jailbreak methods that subvert the reasoning-based guardrails. Our attacks span white-, gray-, and black-box settings and range from effortless template manipulations to fully automated optimization. Along with the potential for scalable implementation, these methods also achieve alarmingly high attack success rates (e.g., exceeding 90% across 5 different benchmarks on gpt-oss series on both local host models and online API services). Evaluations across various leading open-source LRMs confirm that these vulnerabilities are systemic, underscoring the urgent need for stronger alignment techniques for open-sourced LRMs to prevent malicious misuse. Code is open-sourced at https://chenxshuo.github.io/bag-of-tricks.
翻译:近期针对大型推理模型(LRMs)的基于推理的安全防护机制,例如审慎对齐,已展现出对越狱攻击的强大防御能力。这些防护机制通过利用LRMs的推理能力,帮助模型在生成最终回复前评估用户输入的安全性。其强大的推理能力能够分析输入查询的意图,一旦检测到越狱方法所隐藏的有害意图,便会拒绝提供协助。此类防护机制已显示出显著的防御提升,例如在开源gpt-oss系列模型上实现了近乎完美的拒绝率。然而,我们发现这些强大的基于推理的防护机制对输入提示的细微操纵可能极为脆弱,一旦被劫持,甚至可能导致更严重的危害后果。具体而言,我们首先揭示了这些防护机制一个令人惊讶的脆弱性:仅需在输入提示中添加少量模板标记,即可成功绕过看似强大的防护机制,并诱导模型生成明确且有害的回复。为进一步探索,我们提出了一套用于颠覆基于推理防护机制的越狱方法集合。我们的攻击涵盖白盒、灰盒和黑盒场景,范围从简单的模板操纵到全自动优化。这些方法不仅具备可扩展实施的潜力,同时实现了惊人的高攻击成功率(例如,在本地部署模型和在线API服务的gpt-oss系列模型上,跨越5个不同基准测试的攻击成功率均超过90%)。对多个主流开源LRMs的评估证实,这些漏洞具有系统性,凸显了开源LRMs迫切需要更强的对齐技术以防止恶意滥用。代码已开源:https://chenxshuo.github.io/bag-of-tricks。