Current frontier large-language models rely on reasoning to achieve state-of-the-art performance. Many existing interpretability are limited in this area, as standard methods have been designed to study single forward passes of a model rather than the multi-token computational steps that unfold during reasoning. We argue that analyzing reasoning traces at the sentence level is a promising approach to understanding reasoning processes. We introduce a black-box method that measures each sentence's counterfactual importance by repeatedly sampling replacement sentences from the model, filtering for semantically different ones, and continuing the chain of thought from that point onwards to quantify the sentence's impact on the distribution of final answers. We discover that certain sentences can have an outsized impact on the trajectory of the reasoning trace and final answer. We term these sentences \textit{thought anchors}. These are generally planning or uncertainty management sentences, and specialized attention heads consistently attend from subsequent sentences to thought anchors. We further show that examining sentence-sentence causal links within a reasoning trace gives insight into a model's behavior. Such information can be used to predict a problem's difficulty and the extent different question domains involve sequential or diffuse reasoning. As a proof-of-concept, we demonstrate that our techniques together provide a practical toolkit for analyzing reasoning models by conducting a detailed case study of how the model solves a difficult math problem, finding that our techniques yield a consistent picture of the reasoning trace's structure. We provide an open-source tool (thought-anchors.com) for visualizing the outputs of our methods on further problems. The convergence across our methods shows the potential of sentence-level analysis for a deeper understanding of reasoning models.
翻译:当前前沿的大语言模型依赖推理来实现最先进的性能。许多现有的可解释性方法在此领域存在局限,因为标准方法旨在研究模型的单次前向传播,而非推理过程中展开的多标记计算步骤。我们认为在句子层面分析推理轨迹是理解推理过程的一种有前景的方法。我们提出一种黑盒方法,通过重复从模型中采样替换句子、筛选语义不同的句子,并从该点继续思维链以量化句子对最终答案分布的影响,从而测量每个句子的反事实重要性。我们发现某些句子对推理轨迹和最终答案的走向具有超乎寻常的影响。我们将这些句子称为\textit{思维锚点}。这些通常是规划或不确定性管理句子,且专门的注意力头会持续从后续句子关注到思维锚点。我们进一步表明,检查推理轨迹中句子间的因果联系可以洞察模型的行为。此类信息可用于预测问题的难度以及不同问题领域涉及顺序推理或分散推理的程度。作为概念验证,我们通过详细案例研究展示我们的技术共同为分析推理模型提供了一个实用工具包,探究模型如何解决一个困难的数学问题,发现我们的技术对推理轨迹结构得出一致的描述。我们提供了一个开源工具(thought-anchors.com),用于在更多问题上可视化我们方法的输出结果。我们方法之间的收敛性表明,句子层面分析对于深入理解推理模型具有潜力。