Language models often solve complex tasks by generating long reasoning chains, consisting of many steps with varying importance. While some steps are crucial for generating the final answer, others are removable. Determining which steps matter most, and why, remains an open question central to understanding how models process reasoning. We investigate if this question is best approached through model internals or through tokens of the reasoning chain itself. We find that model activations contain more information than tokens for identifying important reasoning steps. Crucially, by training probes on model activations to predict importance, we show that models encode an internal representation of step importance, even prior to the generation of subsequent steps. The internal representations of importance in different models yield high agreement on which steps are important. The representation is distributed across layers, and does not correlate with surface-level features, such as a step's relative position or its length. Our findings suggest that analyzing activations can reveal aspects of reasoning that surface-level approaches fundamentally miss, indicating that reasoning analyses should look into model internals.
翻译:语言模型常通过生成由多个重要性各异的步骤组成的冗长推理链来解决复杂任务。部分步骤对生成最终答案至关重要,而另一些则可被移除。如何确定哪些步骤最为关键及其原因,始终是理解模型推理过程的核心未解问题。我们探究此问题的最佳切入点究竟是模型内部机制还是推理链本身的词元。研究发现,在识别重要推理步骤方面,模型激活状态蕴含的信息量远超词元。关键的是,通过训练基于模型激活状态的探测模型来预测重要性,我们发现模型在生成后续步骤之前,就已对步骤重要性形成了内部表征。不同模型对关键步骤的重要性内部表征具有高度一致性。这种表征分布于各层网络之中,且与步骤的相对位置或长度等表层特征无关。我们的研究表明,分析激活状态能揭示表层方法根本遗漏的推理维度,这意味着推理分析应当深入模型内部。