Large Language Models (LLMs) have achieved strong complex reasoning capabilities through Chain-of-Thought (CoT) reasoning. However, their reasoning patterns remain too complicated to analyze. While Sparse Autoencoders (SAEs) have emerged as a powerful tool for interpretability, existing approaches predominantly operate at the token level, creating a granularity mismatch when capturing more critical step-level information, such as reasoning direction and semantic transitions. In this work, we propose step-level sparse autoencoder (SSAE), which serves as an analytical tool to disentangle different aspects of LLMs' reasoning steps into sparse features. Specifically, by precisely controlling the sparsity of a step feature conditioned on its context, we form an information bottleneck in step reconstruction, which splits incremental information from background information and disentangles it into several sparsely activated dimensions. Experiments on multiple base models and reasoning tasks show the effectiveness of the extracted features. By linear probing, we can easily predict surface-level information, such as generation length and first token distribution, as well as more complicated properties, such as the correctness and logicality of the step. These observations indicate that LLMs should already at least partly know about these properties during generation, which provides the foundation for the self-verification ability of LLMs. The code is available at https://github.com/Miaow-Lab/SSAE
翻译:大型语言模型(LLM)通过思维链(CoT)推理已展现出强大的复杂推理能力。然而,其推理模式仍然过于复杂而难以分析。尽管稀疏自编码器(SAE)已成为可解释性研究的有力工具,但现有方法主要在词元级别进行操作,导致在捕捉更关键的步骤级信息(如推理方向与语义转换)时存在粒度失配问题。本研究提出步骤级稀疏自编码器(SSAE),作为一种分析工具将LLM推理步骤的不同方面解耦为稀疏特征。具体而言,通过精确控制步骤特征在其上下文条件下的稀疏性,我们在步骤重构中构建信息瓶颈,从而将增量信息从背景信息中分离,并将其解耦至若干稀疏激活的维度。在多个基础模型与推理任务上的实验证明了所提取特征的有效性。通过线性探测,我们能够轻松预测表层信息(如生成长度与首词分布)以及更复杂的属性(如步骤的正确性与逻辑性)。这些观察表明LLM在生成过程中应已至少部分知晓这些属性,这为LLM的自验证能力提供了基础。代码发布于 https://github.com/Miaow-Lab/SSAE