Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models

Large language models (LLMs) increasingly rely on explicit chain-of-thought (CoT) reasoning to solve complex tasks, yet the safety of the reasoning process itself remains largely unaddressed. Existing work on LLM safety focuses on content safety--detecting harmful, biased, or factually incorrect outputs -- and treats the reasoning chain as an opaque intermediate artifact. We identify reasoning safety as an orthogonal and equally critical security dimension: the requirement that a model's reasoning trajectory be logically consistent, computationally efficient, and resistant to adversarial manipulation. We make three contributions. First, we formally define reasoning safety and introduce a nine-category taxonomy of unsafe reasoning behaviors, covering input parsing errors, reasoning execution errors, and process management errors. Second, we conduct a large-scale prevalence study annotating 4111 reasoning chains from both natural reasoning benchmarks and four adversarial attack methods (reasoning hijacking and denial-of-service), confirming that all nine error types occur in practice and that each attack induces a mechanistically interpretable signature. Third, we propose a Reasoning Safety Monitor: an external LLM-based component that runs in parallel with the target model, inspects each reasoning step in real time via a taxonomy-embedded prompt, and dispatches an interrupt signal upon detecting unsafe behavior. Evaluation on a 450-chain static benchmark shows that our monitor achieves up to 84.88\% step-level localization accuracy and 85.37\% error-type classification accuracy, outperforming hallucination detectors and process reward model baselines by substantial margins. These results demonstrate that reasoning-level monitoring is both necessary and practically achievable, and establish reasoning safety as a foundational concern for the secure deployment of large reasoning models.

翻译：大语言模型（LLM）日益依赖显式的思维链（CoT）推理来解决复杂任务，然而推理过程本身的安全性仍未得到充分关注。现有的LLM安全性研究聚焦于内容安全——检测有害、偏见或事实错误的输出——并将推理链视作不透明的中间产物。我们提出推理安全性是一个正交且同等重要的安全维度：即要求模型的推理轨迹在逻辑上一致、计算上高效，并能抵御对抗性操控。我们做出三项贡献。首先，我们正式定义了推理安全性，并引入了一个涵盖九类不安全推理行为的分类体系，包括输入解析错误、推理执行错误和进程管理错误。其次，我们开展了一项大规模流行率研究，对来自自然推理基准测试和四种对抗攻击方法（推理劫持与拒绝服务）的4111条推理链进行了标注，证实所有九类错误类型在实际中均会出现，且每种攻击都会引发可机械解释的特征签名。第三，我们提出了一种推理安全监控器：一种基于外部LLM的组件，与目标模型并行运行，通过嵌入分类法的提示实时检查每个推理步骤，并在检测到不安全行为时发出中断信号。在包含450条推理链的静态基准测试上的评估表明，我们的监控器在步骤级定位准确率上达到84.88%，在错误类型分类准确率上达到85.37%，大幅超过了幻觉检测器和过程奖励模型基线。这些结果表明，推理层级的监控既是必要的，也是实际可行的，并将推理安全性确立为大型推理模型安全部署的基础性问题。