This paper presents AutoRAN, the first framework to automate the hijacking of internal safety reasoning in large reasoning models (LRMs). At its core, AutoRAN pioneers an execution simulation paradigm that leverages a weaker but less-aligned model to simulate execution reasoning for initial hijacking attempts and iteratively refine attacks by exploiting reasoning patterns leaked through the target LRM's refusals. This approach steers the target model to bypass its own safety guardrails and elaborate on harmful instructions. We evaluate AutoRAN against state-of-the-art LRMs, including GPT-o3/o4-mini and Gemini-2.5-Flash, across multiple benchmarks (AdvBench, HarmBench, and StrongReject). Results show that AutoRAN achieves approaching 100% success rate within one or few turns, effectively neutralizing reasoning-based defenses even when evaluated by robustly aligned external models. This work reveals that the transparency of the reasoning process itself creates a critical and exploitable attack surface, highlighting the urgent need for new defenses that protect models' reasoning traces rather than merely their final outputs.
翻译:本文提出了AutoRAN,这是首个实现自动化劫持大型推理模型(LRMs)内部安全推理的框架。其核心创新在于引入了一种执行模拟范式,利用较弱但对齐程度较低的模型模拟执行推理进行初始劫持尝试,并通过利用目标LRM拒绝响应中泄露的推理模式来迭代优化攻击。该方法引导目标模型绕过其自身的安全护栏,并详细阐述有害指令。我们针对最先进的LRM(包括GPT-o3/o4-mini和Gemini-2.5-Flash)在多个基准测试(AdvBench、HarmBench和StrongReject)上对AutoRAN进行了评估。结果表明,AutoRAN在一轮或数轮交互内即可达到接近100%的成功率,即使由鲁棒对齐的外部模型评估时,也能有效瓦解基于推理的防御。这项工作揭示了推理过程本身的透明性创造了一个关键且可被利用的攻击面,凸显了迫切需要开发保护模型推理轨迹而非仅仅其最终输出的新型防御措施。