Chain-of-thought (CoT) reasoning is fundamental to modern LLM architectures and represents a critical intervention point for AI safety. However, CoT reasoning may exhibit failure modes that we note as pathologies, which prevent it from being useful for monitoring. Prior work has identified three distinct pathologies: post-hoc rationalization, where models generate plausible explanations backwards from predetermined answers; encoded reasoning, where intermediate steps conceal information within seemingly interpretable text; and internalized reasoning, where models replace explicit reasoning with meaningless filler tokens while computing internally. To better understand and discriminate between these pathologies, we create a set of concrete metrics that are simple to implement, computationally inexpensive, and task-agnostic. To validate our approach, we develop model organisms deliberately trained to exhibit specific CoT pathologies. Our work provides a practical toolkit for assessing CoT pathologies, with direct implications for training-time monitoring.
翻译:思维链推理是现代大语言模型架构的基础,也是人工智能安全的关键干预点。然而,思维链推理可能表现出我们称之为病态行为的失效模式,这阻碍了其用于监控的效用。先前的研究已识别出三种不同的病态行为:事后合理化,即模型从预先确定的答案反向生成看似合理的解释;编码推理,即中间步骤在看似可解释的文本中隐藏信息;以及内化推理,即模型在内部计算时用无意义的填充标记替代显式推理。为了更好地理解和区分这些病态行为,我们创建了一套易于实现、计算成本低且与任务无关的具体度量指标。为了验证我们的方法,我们开发了经过刻意训练以展现特定思维链病态行为的模型样本。本研究为评估思维链病态行为提供了一个实用工具包,对训练时监控具有直接意义。