Sparse Autoencoders (SAEs) have emerged as a promising tool for interpreting neural networks by decomposing their activations into sparse sets of human-interpretable features. Recent work has introduced multiple SAE variants and successfully scaled them to frontier models. Despite much excitement, a growing number of negative results in downstream tasks casts doubt on whether SAEs recover meaningful features. To directly investigate this, we perform two complementary evaluations. On a synthetic setup with known ground-truth features, we demonstrate that SAEs recover only $9\%$ of true features despite achieving $71\%$ explained variance, showing that they fail at their core task even when reconstruction is strong. To evaluate SAEs on real activations, we introduce three baselines that constrain SAE feature directions or their activation patterns to random values. Through extensive experiments across multiple SAE architectures, we show that our baselines match fully-trained SAEs in interpretability (0.87 vs 0.90), sparse probing (0.69 vs 0.72), and causal editing (0.73 vs 0.72). Together, these results suggest that SAEs in their current state do not reliably decompose models' internal mechanisms.
翻译:稀疏自编码器(SAEs)作为一种通过将神经网络激活分解为稀疏的、人类可解释的特征集来解析神经网络的工具,已展现出广阔前景。近期研究提出了多种SAE变体,并成功将其扩展至前沿模型。尽管备受关注,但下游任务中日益增多的负面结果使人们对SAEs是否真正提取出有意义特征产生质疑。为直接探究此问题,我们进行了两项互补评估。在已知真实特征的合成实验环境中,我们证明SAEs仅能恢复$9\%$的真实特征,尽管其解释方差达到$71\%$,这表明即使重构效果良好,SAEs仍未能完成其核心任务。为评估SAEs在真实激活上的表现,我们引入三种基线方法,将SAE特征方向或其激活模式约束为随机值。通过对多种SAE架构的广泛实验,我们证明所提基线在可解释性(0.87 vs 0.90)、稀疏探测(0.69 vs 0.72)和因果编辑(0.73 vs 0.72)方面与完全训练的SAEs表现相当。这些结果共同表明,当前状态的SAEs并不能可靠地分解模型的内部机制。