Large language models (LLMs) demonstrate strong reasoning abilities in solving complex real-world problems. Yet, the internal mechanisms driving these complex reasoning behaviors remain opaque. Existing interpretability approaches targeting reasoning either identify components (e.g., neurons) correlated with special textual patterns, or rely on human-annotated contrastive pairs to derive control vectors. Consequently, current methods struggle to precisely localize complex reasoning mechanisms or capture sequential influence from model internal workings to the reasoning outputs. In this paper, built on outcome-oriented and sequential-influence-aware principles, we focus on identifying components that have sequential contribution to reasoning behavior where outcomes are cumulated by long-range effects. We propose Integrated Policy Gradient (IPG), a novel framework that attributes reasoning behaviors to model's inner components by propagating compound outcome-based signals such as post reasoning accuracy backward through model inference trajectories. Empirical evaluations demonstrate that our approach achieves more precise localization and enables reliable modulation of reasoning behaviors (e.g., reasoning capability, reasoning strength) across diverse reasoning models.
翻译:大语言模型(LLM)在解决复杂现实问题中展现出强大的推理能力。然而,驱动这些复杂推理行为的内部机制仍不透明。现有针对推理的可解释性方法要么识别与特定文本模式(如神经元)相关的组件,要么依赖人工标注的对比对来推导控制向量。因此,当前方法难以精确定位复杂推理机制,或捕捉从模型内部运作到推理输出的序列化影响。本文基于结果导向和序列影响感知原则,重点识别对推理行为具有序列贡献的组件,其中结果由长程效应累积形成。我们提出集成策略梯度(IPG),这是一种新颖的框架,通过将基于复合结果的信号(如推理后准确率)沿模型推理轨迹反向传播,将推理行为归因于模型的内部组件。实证评估表明,我们的方法实现了更精确的定位,并能够可靠地调节多种推理模型中的推理行为(如推理能力、推理强度)。