Chain-of-thought (CoT) outputs let us read a model's step-by-step reasoning. Since any long, serial reasoning process must pass through this textual trace, the quality of the CoT is a direct window into what the model is thinking. This visibility could help us spot unsafe or misaligned behavior (monitorability), but only if the CoT is transparent about its internal reasoning (faithfulness). Fully measuring faithfulness is difficult, so researchers often focus on examining the CoT in cases where the model changes its answer after adding a cue to the input. This proxy finds some instances of unfaithfulness but loses information when the model maintains its answer, and does not investigate aspects of reasoning not tied to the cue. We extend these results to a more holistic sense of monitorability by introducing verbosity: whether the CoT lists every factor needed to solve the task. We combine faithfulness and verbosity into a single monitorability score that shows how well the CoT serves as the model's external `working memory', a property that many safety schemes based on CoT monitoring depend on. We evaluate instruction-tuned and reasoning models on BBH, GPQA, and MMLU. Our results show that models can appear faithful yet remain hard to monitor when they leave out key factors, and that monitorability differs sharply across model families. We release our evaluation code using the Inspect library to support reproducible future work.
翻译:思维链输出使我们能够逐步骤解读模型的推理过程。由于任何长序列的推理都必须经过这一文本轨迹,思维链的质量直接反映了模型的思考内容。这种可见性有助于我们识别不安全或未对齐的行为(可监控性),但前提是思维链对其内部推理保持透明(忠实性)。完全衡量忠实性较为困难,因此研究者通常关注在模型收到输入提示后改变答案的情况。这一代理方法虽能发现部分不忠实实例,但在模型维持原答案时会丢失信息,且未探究与提示无关的推理维度。我们通过引入详尽性——即思维链是否列出解决任务所需的全部因素——将研究拓展至更全面的可监控性范畴。我们将忠实性与详尽性结合为单一的可监控性评分,用以衡量思维链作为模型外部“工作记忆”的效能,这是许多基于思维链监控的安全方案所依赖的关键特性。我们在BBH、GPQA和MMLU数据集上对指令微调模型与推理模型进行了评估。结果表明,模型可能表现出忠实性却因遗漏关键因素而难以监控,且不同模型系列的可监控性存在显著差异。我们基于Inspect库发布了评估代码,以支持可复现的后续研究。