Large Reasoning Models (LRMs) leverage Chain-of-Thought (CoT) reasoning to solve complex tasks, but this explicit reasoning process introduces a critical vulnerability: adversarial manipulation of the thought chain itself, known as Chain-of-Thought Attacks (CoTA). Such attacks subtly corrupt the reasoning path to produce erroneous outputs, challenging conventional defenses that often sacrifice model utility for safety. To address this, we propose Thought Purity(TP), a defense framework that shifts from passive refusal to active reasoning recovery. TP integrates a safety-aware data pipeline with reinforcement learning, employing a dual-reward mechanism to teach models to dynamically identify and isolate malicious logic while preserving correct reasoning. Experiments on multiple model families demonstrate that TP significantly reduces the attack success rate of CoTA while maintaining or enhancing the model's performance on benign tasks.
翻译:大型推理模型(LRMs)利用思维链(CoT)推理解决复杂任务,但这一显式推理过程引入了关键漏洞:对思维链本身的对抗性操纵,即思维链攻击(CoTA)。此类攻击通过微妙地破坏推理路径以产生错误输出,对传统防御方法构成挑战——这些方法往往以牺牲模型效用为代价换取安全性。为解决这一问题,我们提出思维纯净性(TP)防御框架,将防御策略从被动拒绝转向主动推理恢复。TP将安全感知数据管道与强化学习相结合,采用双重奖励机制,教导模型动态识别并隔离恶意逻辑,同时保持正确推理。在多类模型家族上的实验表明,TP能显著降低CoTA的攻击成功率,同时在良性任务上维持甚至提升模型性能。