Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex tasks by engaging in extended reasoning before producing final answers. Beyond improving abilities, these detailed reasoning traces also create a new opportunity for AI safety, CoT Monitorability: monitoring potential model misbehavior, such as the use of shortcuts or sycophancy, through their chain-of-thought (CoT) during decision-making. However, two key fundamental challenges arise when attempting to build more effective monitors through CoT analysis. First, as prior research on CoT faithfulness has pointed out, models do not always truthfully represent their internal decision-making in the generated reasoning. Second, monitors themselves may be either overly sensitive or insufficiently sensitive, and can potentially be deceived by models' long, elaborate reasoning traces. In this paper, we present the first systematic investigation of the challenges and potential of CoT monitorability. Motivated by two fundamental challenges we mentioned before, we structure our study around two central perspectives: (i) verbalization: to what extent do LRMs faithfully verbalize the true factors guiding their decisions in the CoT, and (ii) monitor reliability: to what extent can misbehavior be reliably detected by a CoT-based monitor? Specifically, we provide empirical evidence and correlation analyses between verbalization quality, monitor reliability, and LLM performance across mathematical, scientific, and ethical domains. Then we further investigate how different CoT intervention methods, designed to improve reasoning efficiency or performance, will affect monitoring effectiveness. Finally, we propose MoME, a new paradigm in which LLMs monitor other models' misbehavior through their CoT and provide structured judgments along with supporting evidence.
翻译:大型推理模型(LRMs)通过在生成最终答案前进行扩展推理,已在复杂任务中展现出卓越性能。除了提升能力外,这些详细的推理轨迹还为人工智能安全创造了一个新机遇——思维链(CoT)可监控性:即通过模型决策过程中的思维链来监控潜在的模型不当行为,例如使用捷径或谄媚性回应。然而,在尝试通过CoT分析构建更有效的监控器时,出现了两个关键的基础性挑战。首先,正如先前关于CoT忠实性的研究所指出的,模型并不总是真实地在生成的推理中反映其内部决策过程。其次,监控器本身可能过于敏感或不够敏感,并且可能被模型冗长、精细的推理轨迹所欺骗。本文首次对CoT可监控性的挑战与潜力进行了系统性探究。基于前述两个基本挑战,我们围绕两个核心视角构建研究框架:(i)言语化:LRMs在多大程度上能真实地将其决策的真实指导因素在CoT中言语化;(ii)监控器可靠性:基于CoT的监控器能在多大程度上可靠地检测不当行为。具体而言,我们通过数学、科学和伦理领域的实证研究,提供了言语化质量、监控器可靠性与大语言模型(LLM)性能之间的相关分析。随后,我们进一步探究了旨在提升推理效率或性能的不同CoT干预方法将如何影响监控有效性。最后,我们提出了MoME这一新范式,其中LLMs通过其他模型的CoT监控其不当行为,并提供结构化判断及支持性证据。