Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process

Despite the growing reasoning capabilities of recent large language models (LLMs), their internal mechanisms during the reasoning process remain underexplored. Prior approaches often rely on human-defined concepts (e.g., overthinking, reflection) at the word level to analyze reasoning in a supervised manner. However, such methods are limited, as it is infeasible to capture the full spectrum of potential reasoning behaviors, many of which are difficult to define in token space. In this work, we propose an unsupervised framework (namely, RISE: Reasoning behavior Interpretability via Sparse auto-Encoder) for discovering reasoning vectors, which we define as directions in the activation space that encode distinct reasoning behaviors. By segmenting chain-of-thought traces into sentence-level 'steps' and training sparse auto-encoders (SAEs) on step-level activations, we uncover disentangled features corresponding to interpretable behaviors such as reflection and backtracking. Visualization and clustering analyses show that these behaviors occupy separable regions in the decoder column space. Moreover, targeted interventions on SAE-derived vectors can controllably amplify or suppress specific reasoning behaviors, altering inference trajectories without retraining. Beyond behavior-specific disentanglement, SAEs capture structural properties such as response length, revealing clusters of long versus short reasoning traces. More interestingly, SAEs enable the discovery of novel behaviors beyond human supervision. We demonstrate the ability to control response confidence by identifying confidence-related vectors in the SAE decoder space. These findings underscore the potential of unsupervised latent discovery for both interpreting and controllably steering reasoning in LLMs.

翻译：尽管近期大型语言模型（LLM）的推理能力不断增强，但其在推理过程中的内部机制仍未得到充分探索。现有方法通常依赖词汇层面的人工定义概念（如过度思考、反思）以监督方式分析推理行为。然而，此类方法存在局限，因为难以涵盖所有潜在的推理行为谱系，且许多行为难以在词元空间中明确定义。本研究提出一种无监督框架（即RISE：基于稀疏自编码器的推理行为可解释性方法），用于发现推理向量——我们将其定义为编码特定推理行为的激活空间方向。通过将思维链轨迹分割为句子级“步骤”并在步骤级激活上训练稀疏自编码器（SAE），我们分离出对应可解释行为（如反思与回溯）的解耦特征。可视化与聚类分析表明，这些行为在解码器列空间中占据可分离区域。进一步地，通过对SAE衍生向量进行定向干预，可在无需重新训练的情况下可控地增强或抑制特定推理行为，从而改变推理轨迹。除行为特异性解耦外，SAE还能捕获响应长度等结构特性，揭示长推理轨迹与短推理轨迹的聚类特征。更有趣的是，SAE能够发现超越人工监督的新型行为。我们通过识别SAE解码器空间中与置信度相关的向量，展示了控制响应置信度的能力。这些发现凸显了无监督潜在发现方法在解释和可控引导LLM推理方面的潜力。