Decompose Sparsely Where You Should, Absorb Densely Where You Should No

Sparse autoencoders (SAEs) are typically trained to reconstruct the \textbf{entire} residual stream through a sparse dictionary, implicitly assuming that all activation content is amenable to sparse, monosemantic decomposition. We question this assumption and hypothesize that activations contain a low-rank, dense component that is computationally important to the model yet inherently unsuitable for sparse representation, which serves as a major source of the persistent dense latents widely observed in trained SAEs. To test this, we add a small rank-$r$ linear bottleneck in parallel with standard SAEs (BatchTopK and Matryoshka), allowing dense structure to be absorbed before sparse reconstruction. On Gemma-2-2B layer 12, a rank-24 bottleneck reduces dense latent count by up to 84\% while improving sparse probing and targeted probe perturbation on both architectures at matched sparsity. The absorbed component is (i) \textbf{structurally identifiable} as the top principal components and outlier dimensions; (ii) \textbf{causally necessary}, with removing it raising next-token cross-entropy by 7.5$\times$, far exceeding the 2.8$\times$ from removing the geometrically near-identical top-24 PCA directions; and (iii) \textbf{redundantly encoded by sparse dictionaries}, with ablating 787 maximally aligned sparse features raising cross-entropy by only 2.9$\times$ and ablating 2,048 topic-aligned features leaving MMLU topic classification virtually unchanged, whereas removing the scaffold drops it from 98.7\% to chance. Together, our findings identify a compact, semantically informative and causally important component of residual stream activations (which we term a \textbf{computational scaffold}) that standard sparse dictionaries represent inefficiently, suggesting that the scope of sparsity-based interpretability methods warrants careful re-examination.

翻译：稀疏自编码器（SAEs）通常被训练用于通过稀疏字典重建残差流的**全部**内容，这隐含地假设所有激活信息都适用于稀疏、单语义的分解。我们对此假设提出质疑，认为激活中包含一个低秩、稠密的成分，该成分对模型计算至关重要，但本质上不适合稀疏表示，而这正是训练好的SAEs中普遍观察到的持久稠密潜在因子的主要来源。为验证这一假设，我们在标准SAEs（BatchTopK和Matryoshka）中并行添加一个小型秩为$r$的线性瓶颈，使得稠密结构在稀疏重建之前被吸收。在Gemma-2-2B的第12层上，一个秩为24的瓶颈将稠密潜在因子的数量减少了最多84%，同时在匹配稀疏度的条件下，显著提升了两种架构上的稀疏探测与定向探测扰动表现。被吸收的成分具有以下特征：(i) **结构可识别**，表现为前几个主成分和离群维度；(ii) **因果必要**，移除后导致下一个词元交叉熵损失增加7.5倍，远超移除几何上近乎相同的24维PCA方向所产生的2.8倍增长；(iii) **被稀疏字典冗余编码**，消融787个最大对齐的稀疏特征仅使交叉熵增加2.9倍，消融2,048个主题对齐特征几乎不影响MMLU主题分类，而移除该“骨架”则使准确率从98.7%降至随机水平。综合而言，我们的研究发现残差流激活中存在一个紧凑、语义丰富且因果重要的成分（我们称之为**计算骨架**），标准稀疏字典对该成分的表示效率低下，这表明基于稀疏性的可解释性方法的适用范围值得审慎重新审视。