Large Reasoning Models (LRMs) leverage transparent reasoning traces, known as Chain-of-Thoughts (CoTs), to break down complex problems into intermediate steps and derive final answers. However, these reasoning traces introduce unique safety challenges: harmful content can be embedded in intermediate steps even when final answers appear benign. Existing moderation tools, designed to handle generated answers, struggle to effectively detect hidden risks within CoTs. To address these challenges, we introduce ReasoningShield, a lightweight yet robust framework for moderating CoTs in LRMs. Our key contributions include: (1) formalizing the task of CoT moderation with a multi-level taxonomy of 10 risk categories across 3 safety levels, (2) creating the first CoT moderation benchmark which contains 9.2K pairs of queries and reasoning traces, including a 7K-sample training set annotated via a human-AI framework and a rigorously curated 2.2K human-annotated test set, and (3) developing a two-stage training strategy that combines stepwise risk analysis and contrastive learning to enhance robustness. Experiments show that ReasoningShield achieves state-of-the-art performance, outperforming task-specific tools like LlamaGuard-4 by 35.6% and general-purpose commercial models like GPT-4o by 15.8% on benchmarks, while also generalizing effectively across diverse reasoning paradigms, tasks, and unseen scenarios. All resources are released at https://github.com/CosmosYi/ReasoningShield.
翻译:大型推理模型(LRMs)利用透明的推理轨迹(即思维链,CoT)将复杂问题分解为中间步骤并推导最终答案。然而,这些推理轨迹引入了独特的安全挑战:即使最终答案看似无害,有害内容仍可能嵌入中间步骤。现有的审核工具旨在处理生成的答案,难以有效检测CoT中隐藏的风险。为应对这些挑战,我们提出了ReasoningShield,一个轻量级且鲁棒的框架,用于审核LRMs中的思维链。我们的核心贡献包括:(1)通过涵盖3个安全级别、10个风险类别的多级分类体系,形式化CoT审核任务;(2)创建首个CoT审核基准,包含9.2K对查询与推理轨迹,其中包含通过人机协作框架标注的7K样本训练集,以及严格筛选的2.2K人工标注测试集;(3)开发结合逐步风险分析与对比学习的两阶段训练策略以增强鲁棒性。实验表明,ReasoningShield实现了最先进的性能,在基准测试中分别超越LlamaGuard-4等专用工具35.6%和GPT-4o等通用商业模型15.8%,同时能有效泛化至不同推理范式、任务及未见场景。所有资源发布于https://github.com/CosmosYi/ReasoningShield。