We investigate whether sparse autoencoders (SAEs) identify genuine reasoning features in large language models (LLMs). We first show through a simple theoretical analysis that $\ell_1$-regularized SAEs are intrinsically biased toward low-dimensional patterns, providing a mechanistic explanation for why shallow linguistic cues may be preferentially captured over distributed reasoning behaviors. Motivated by this bias, we introduce a falsification-oriented evaluation framework that combines causal token injection and LLM-guided falsification to test whether feature activation reflects reasoning processes or superficial linguistic correlates. Across 20 configurations spanning multiple model families, layers, and reasoning datasets, we find that features identified by contrastive methods are highly sensitive to token-level interventions, with 45% to 90% activating when a small number of associated tokens are injected into non-reasoning text. For the remaining features, LLM-guided falsification consistently produces non-reasoning inputs that activate the feature and reasoning inputs that do not, with no analyzed feature satisfying our criteria for genuine reasoning behavior. Steering these features yields no improvements in benchmark performance. Overall, our results suggest that SAE features identified by current contrastive approaches primarily capture linguistic correlates of reasoning rather than the underlying reasoning computations themselves. Code is available at https://github.com/GeorgeMLP/reasoning-probing.
翻译:本研究探讨稀疏自编码器(SAEs)是否能在大型语言模型(LLMs)中识别出真正的推理特征。我们首先通过简单的理论分析表明,$\ell_1$正则化的稀疏自编码器本质上偏向于低维模式,这从机制上解释了为何浅层语言线索可能比分布式推理行为更易被捕获。受此偏差启发,我们引入了一种以证伪为导向的评估框架,该框架结合了因果令牌注入和LLM引导的证伪方法,以检验特征激活反映的是推理过程还是表面的语言关联。在涵盖多个模型系列、层级和推理数据集的20种配置中,我们发现通过对比方法识别的特征对令牌级干预高度敏感:当少量相关令牌被注入非推理文本时,45%至90%的特征会被激活。对于剩余的特征,LLM引导的证伪方法持续产生能激活该特征的非推理输入以及不能激活该特征的推理输入,且所有被分析的特征均未满足我们对于真正推理行为的判定标准。对这些特征进行调控并未带来基准性能的提升。总体而言,我们的结果表明,当前对比方法所识别的稀疏自编码器特征主要捕获的是推理的语言关联,而非底层的推理计算本身。代码可在 https://github.com/GeorgeMLP/reasoning-probing 获取。