We investigate whether sparse autoencoders (SAEs) identify genuine reasoning features in large language models (LLMs). We first show through a simple theoretical analysis that $\ell_1$-regularized SAEs are intrinsically biased toward low-dimensional patterns, providing a mechanistic explanation for why shallow linguistic cues may be preferentially captured over distributed reasoning behaviors. Motivated by this bias, we introduce a falsification-oriented evaluation framework that combines causal token injection and LLM-guided falsification to test whether feature activation reflects reasoning processes or superficial linguistic correlates. Across 20 configurations spanning multiple model families, layers, and reasoning datasets, we find that features identified by contrastive methods are highly sensitive to token-level interventions, with 45% to 90% activating when a small number of associated tokens are injected into non-reasoning text. For the remaining features, LLM-guided falsification consistently produces non-reasoning inputs that activate the feature and reasoning inputs that do not, with no analyzed feature satisfying our criteria for genuine reasoning behavior. Steering these features yields no improvements in benchmark performance. Overall, our results suggest that SAE features identified by current contrastive approaches primarily capture linguistic correlates of reasoning rather than the underlying reasoning computations themselves. Code is available at https://github.com/GeorgeMLP/reasoning-probing.
翻译:本研究探讨稀疏自编码器(SAEs)是否能够识别大语言模型(LLMs)中真实的推理特征。我们首先通过简单的理论分析表明,经$\ell_1$正则化的稀疏自编码器本质上偏向于低维模式,这为浅层语言线索可能优先于分布式推理行为被捕获的现象提供了机制性解释。受此偏差启发,我们引入了一个以证伪为导向的评估框架,该框架结合了因果令牌注入和LLM引导的证伪方法,以检验特征激活反映的是推理过程还是表面的语言相关性。在涵盖多个模型系列、层级和推理数据集的20种配置中,我们发现通过对比方法识别的特征对令牌级干预高度敏感:当少量相关令牌被注入非推理文本时,45%至90%的特征会被激活。对于剩余的特征,LLM引导的证伪始终能产生激活该特征的非推理输入以及不激活该特征的推理输入,且所有被分析的特征均未满足我们设定的真实推理行为标准。对这些特征进行引导并未带来基准性能的提升。总体而言,我们的研究结果表明,当前对比方法所识别的稀疏自编码器特征主要捕获的是推理的语言相关性,而非底层的推理计算本身。代码可在 https://github.com/GeorgeMLP/reasoning-probing 获取。