Current Chain-of-Thought (CoT) verification methods predict reasoning correctness based on outputs (black-box) or activations (gray-box), but offer limited insight into why a computation fails. We introduce a white-box method: Circuit-based Reasoning Verification (CRV). We hypothesize that attribution graphs of correct CoT steps, viewed as execution traces of the model's latent reasoning circuits, possess distinct structural fingerprints from those of incorrect steps. By training a classifier on structural features of these graphs, we show that these traces contain a powerful signal of reasoning errors. Our white-box approach yields novel scientific insights unattainable by other methods. (1) We demonstrate that structural signatures of error are highly predictive, establishing the viability of verifying reasoning directly via its computational graph. (2) We find these signatures to be highly domain-specific, revealing that failures in different reasoning tasks manifest as distinct computational patterns. (3) We provide evidence that these signatures are not merely correlational; by using our analysis to guide targeted interventions on individual transcoder features, we successfully correct the model's faulty reasoning. Our work shows that, by scrutinizing a model's computational process, we can move from simple error detection to a deeper, causal understanding of LLM reasoning.
翻译:当前的思维链(CoT)验证方法主要基于输出(黑盒)或激活状态(灰盒)来预测推理正确性,但难以深入揭示计算失败的原因。我们提出一种白盒方法:基于电路的推理验证(CRV)。我们假设,正确的CoT步骤的归因图——可视作模型潜在推理电路的执行轨迹——与错误步骤的归因图具有截然不同的结构指纹。通过训练一个分类器来识别这些图的结构特征,我们证明这些轨迹包含了强大的推理错误信号。我们的白盒方法带来了其他方法无法获得的新科学洞见。(1)我们证明错误的结构特征具有高度可预测性,确立了直接通过计算图验证推理的可行性。(2)我们发现这些特征具有高度领域特异性,揭示了不同推理任务中的失败会表现为不同的计算模式。(3)我们提供证据表明这些特征不仅仅是相关性存在;通过利用我们的分析来指导对单个转码器特征的针对性干预,我们成功纠正了模型的错误推理。我们的研究表明,通过审视模型的计算过程,我们可以从简单的错误检测转向对LLM推理更深入、因果性的理解。