We investigate whether sparse autoencoders (SAEs) identify genuine reasoning features in large language models (LLMs). Starting from features selected using standard contrastive activation methods, we introduce a falsification-oriented framework that combines causal token injection experiments and LLM-guided falsification to test whether feature activation reflects reasoning processes or superficial linguistic correlates. Across 20 configurations spanning multiple model families, layers, and reasoning datasets, we find that identified reasoning features are highly sensitive to token-level interventions. Injecting a small number of feature-associated tokens into non-reasoning text is sufficient to elicit strong activation for 59% to 94% of features, indicating reliance on lexical artifacts. For the remaining features that are not explained by simple token triggers, LLM-guided falsification consistently produces non-reasoning inputs that activate the feature and reasoning inputs that do not, with no analyzed feature satisfying our criteria for genuine reasoning behavior. Steering these features yields minimal changes or slight degradations in benchmark performance. Together, these results suggest that SAE features identified by contrastive approaches primarily capture linguistic correlates of reasoning rather than the underlying reasoning computations themselves.
翻译:本研究探讨稀疏自编码器(SAEs)是否能够识别大型语言模型(LLMs)中真实的推理特征。从基于标准对比激活方法筛选的特征出发,我们提出一个以证伪为导向的框架,该框架结合因果标记注入实验和LLM引导的证伪方法,以检验特征激活是否反映推理过程或仅是表层的语言相关性。在涵盖多个模型系列、层级和推理数据集的20种配置中,我们发现被识别的推理特征对标记层级的干预高度敏感。在非推理文本中注入少量与特征相关的标记,足以引发59%至94%的特征产生强烈激活,这表明这些特征依赖于词汇层面的伪影。对于其余无法通过简单标记触发器解释的特征,LLM引导的证伪方法持续产生能激活该特征的非推理输入以及不能激活该特征的推理输入,且所有被分析的特征均未满足我们设定的真实推理行为标准。对这些特征进行引导调控仅导致基准性能的微小变化或轻微下降。综合来看,这些结果表明通过对比方法识别的SAE特征主要捕捉的是推理的语言相关性,而非底层的推理计算本身。