Thought Branches: Interpreting LLM Reasoning Requires Resampling

Most work interpreting reasoning models studies only a single chain-of-thought (CoT), yet these models define distributions over many possible CoTs. We argue that studying a single sample is inadequate for understanding causal influence and the underlying computation. Though fully specifying this distribution is intractable, we can measure a partial CoT's impact by resampling only the subsequent text. We present case studies using resampling to investigate model decisions. First, when a model states a reason for its action, does that reason actually cause the action? In "agentic misalignment" scenarios, we find that self-preservation sentences have small causal impact, suggesting they do not meaningfully drive blackmail. Second, are artificial edits to CoT sufficient for steering reasoning? Resampling and selecting a completion with the desired property is a principled on-policy alternative. We find that off-policy interventions yield small and unstable effects compared to resampling in decision-making tasks. Third, how do we understand the effect of removing a reasoning step when the model may repeat it post-edit? We introduce a resilience metric that repeatedly resamples to prevent similar content from reappearing downstream. Critical planning statements resist removal but have large effects when eliminated. Fourth, since CoT is sometimes "unfaithful", can our methods teach us anything in these settings? Adapting causal mediation analysis, we find that hints that causally affect the output without being explicitly mentioned exert a subtle and cumulative influence on the CoT that persists even if the hint is removed. Overall, studying distributions via resampling enables reliable causal analysis, clearer narratives of model reasoning, and principled CoT interventions.

翻译：大多数解读推理模型的工作仅研究单条思维链（CoT），但这些模型定义了多条可能CoT的分布。我们认为，仅研究单个样本不足以理解因果影响和底层计算。尽管完整定义该分布不可行，但可以通过仅对后续文本进行重采样来测量部分CoT的影响。我们通过重采样方法进行案例研究以探究模型决策。首先，当模型陈述其行为的原因时，该原因是否真正驱动了行为？在"代理错位"场景中，我们发现自我保存语句的因果影响较小，表明它们并未实质性地驱动要挟行为。其次，人工编辑CoT是否足以引导推理？基于策略的重采样并选择具有期望属性的补全是合理的替代方案。我们发现，与决策任务中的重采样相比，非策略干预产生的效果微小且不稳定。第三，当模型可能在编辑后重复推理步骤时，如何理解移除该步骤的影响？我们引入弹性指标，通过反复重采样阻止相似内容在下游重现。关键规划陈述虽然难以移除，但一旦消除会产生显著效应。第四，由于CoT有时"不忠实"，我们的方法能否在这些场景中提供洞见？通过改进因果中介分析，我们发现，那些因果影响输出却未被明确提及的提示会对CoT产生持久且累积的影响，即使提示被移除仍会延续。总体而言，通过重采样研究分布能实现可靠的因果分析、更清晰的模型推理叙事，以及基于原则的CoT干预方法。