Recent reasoning-based safety guardrails for Large Reasoning Models (LRMs), such as deliberative alignment, have shown strong defense against jailbreak attacks. By leveraging LRMs' reasoning ability, these guardrails help the models to assess the safety of user inputs before generating final responses. The powerful reasoning ability can analyze the intention of the input query and will refuse to assist once it detects the harmful intent hidden by the jailbreak methods. Such guardrails have shown a significant boost in defense, such as the near-perfect refusal rates on the open-source gpt-oss series. Unfortunately, we find that these powerful reasoning-based guardrails can be extremely vulnerable to subtle manipulation of the input prompts, and once hijacked, can lead to even more harmful results. Specifically, we first uncover a surprisingly fragile aspect of these guardrails: simply adding a few template tokens to the input prompt can successfully bypass the seemingly powerful guardrails and lead to explicit and harmful responses. To explore further, we introduce a bag of jailbreak methods that subvert the reasoning-based guardrails. Our attacks span white-, gray-, and black-box settings and range from effortless template manipulations to fully automated optimization. Along with the potential for scalable implementation, these methods also achieve alarmingly high attack success rates (e.g., exceeding 90% across 5 different benchmarks on gpt-oss series on both local host models and online API services). Evaluations across various leading open-source LRMs confirm that these vulnerabilities are systemic, underscoring the urgent need for stronger alignment techniques for open-sourced LRMs to prevent malicious misuse. Code is open-sourced at https://chenxshuo.github.io/bag-of-tricks.
翻译:近期针对大型推理模型(LRMs)的基于推理的安全护栏(如审慎对齐)已展现出对越狱攻击的强大防御能力。这些护栏通过利用LRMs的推理能力,帮助模型在生成最终响应前评估用户输入的安全性。其强大的推理能力可分析输入查询的意图,一旦检测到越狱方法隐藏的恶意意图便会拒绝协助。此类护栏已展现出显著的防御提升,例如在开源gpt-oss系列模型上实现了接近完美的拒绝率。然而我们发现,这些强大的基于推理的护栏极易受到输入提示词细微操纵的影响,一旦被劫持甚至可能导致更严重的危害后果。具体而言,我们首先揭示了这些护栏一个令人惊讶的脆弱性:仅需在输入提示词中添加少量模板标记,即可成功绕过看似强大的护栏,引发显性且有害的响应。为进一步探索,我们提出一套颠覆基于推理护栏的越狱方法集合。我们的攻击涵盖白盒、灰盒与黑盒场景,从无需费力的模板操纵到全自动优化均有涉及。这些方法不仅具备可扩展实施的潜力,同时实现了惊人的高攻击成功率(例如在本地部署模型和在线API服务的gpt-oss系列上,跨越5个不同基准测试的攻击成功率均超过90%)。对各类主流开源LRMs的评估证实,这些漏洞具有系统性特征,凸显了开源LRMs亟需更强大的对齐技术以防止恶意滥用。代码已开源发布于https://chenxshuo.github.io/bag-of-tricks。