Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders

Sparse autoencoders (SAEs) are widely used to interpret neural network representations, but their utility depends on whether the learned features are reproducible across training runs. We study this question through \emph{feature stability}: for each SAE feature, we estimate the probability that a similar feature reappears in an independently trained SAE. This yields a scalable per-feature signal that separates stable from unstable features. In a large-scale study across seeds, models, layers, dictionary sizes, and SAE variants, we find a pronounced functional asymmetry: stable features carry most of the reconstruction- and prediction-relevant signal, while unstable features have weak marginal impact and are dominated by low-frequency surface-form triggers in both activation statistics and automatic explanations. Geometrically, unstable features are individually non-reproducible but concentrate in reproducible lower-rank subspaces, suggesting that seed dependence often reflects basis ambiguity within a shared region of activation space rather than pure noise. A controlled synthetic model makes this mechanism explicit, showing that low-rank ground-truth features can be recovered at the subspace level while remaining non-identifiable as individual SAE latents across seeds. Finally, by pooling unique cross-seed features, we construct more stable SAEs while preserving explained variance in this setting. Together, these results show that unstable features are not merely failed or noisy latents: they have weak individual functional impact, but reflect reproducible low-dimensional structure that standard SAEs resolve differently across seeds.

翻译：稀疏自编码器（SAEs）被广泛用于解释神经网络表示，但它们的实用性取决于所学特征在不同训练运行中是否可重复。我们通过特征稳定性来研究这一问题：对于每个SAE特征，我们估计其在独立训练的SAE中再次出现相似特征的概率。这产生了一个可扩展的每特征信号，将稳定特征与不稳定特征区分开来。在跨种子、模型、层、字典大小以及SAE变体的大规模研究中，我们发现显著的功能不对称性：稳定特征承载了大部分与重建和预测相关的信号，而不稳定特征的边际影响较弱，并且在激活统计和自动解释中主要由低频表面形式触发因素主导。从几何角度看，不稳定特征单独不可重复，但集中在可重复的低秩子空间中，这表明种子依赖性通常反映了共享激活空间内的基模糊性，而非纯噪声。一个受控的合成模型明确了这一机制，表明低秩的真实特征可以在子空间水平上被恢复，而作为单个SAE潜在变量则在不同种子间无法识别。最后，通过汇集跨种子的独特特征，我们在保持解释方差的同时构建了更稳定的SAE。总之，这些结果表明不稳定特征不仅仅是失败或噪声化的潜在变量：它们个体功能影响较弱，但反映了SAE在不同种子中以不同方式解决的、可重复的低维结构。