Reinforcement learning (RL) based post-training for explicit chain-of-thought (e.g., GRPO) improves the reasoning ability of multimodal large-scale reasoning models (MLRMs). But recent evidence shows that it can simultaneously degrade safety alignment and increase jailbreak success rates. We propose SafeThink, a lightweight inference-time defense that treats safety recovery as a satisficing constraint rather than a maximization objective. SafeThink monitors the evolving reasoning trace with a safety reward model and conditionally injects an optimized short corrective prefix ("Wait, think safely") only when the safety threshold is violated. In our evaluations across six open-source MLRMs and four jailbreak benchmarks (JailbreakV-28K, Hades, FigStep, and MM-SafetyBench), SafeThink reduces attack success rates by 30-60% (e.g., LlamaV-o1: 63.33% to 5.74% on JailbreakV-28K, R1-Onevision: 69.07% to 5.65% on Hades) while preserving reasoning performance (MathVista accuracy: 65.20% to 65.00%). A key empirical finding from our experiments is that safety recovery is often only a few steering steps away: intervening in the first 1-3 reasoning steps typically suffices to redirect the full generation toward safe completions.
翻译:基于强化学习(RL)的显式思维链(例如GRPO)后训练提升了多模态大规模推理模型(MLRMs)的推理能力。但近期研究表明,该方法可能同时削弱安全对齐并提高越狱成功率。我们提出SafeThink——一种轻量级推理时防御方法,将安全恢复视为满足性约束而非最大化目标。SafeThink通过安全奖励模型监控演化的推理轨迹,仅在安全阈值被违反时有条件地注入优化的简短纠正前缀("Wait, think safely")。我们在六个开源MLRM和四个越狱基准(JailbreakV-28K、Hades、FigStep与MM-SafetyBench)上的评估显示,SafeThink将攻击成功率降低30-60%(例如LlamaV-o1在JailbreakV-28K上从63.33%降至5.74%,R1-Onevision在Hades上从69.07%降至5.65%),同时保持推理性能(MathVista准确率:65.20%至65.00%)。实验中的关键实证发现是:安全恢复通常仅需数步引导——在前1-3个推理步骤进行干预,通常足以将完整生成过程重定向至安全结果。