Sparse autoencoders (SAEs) are widely used to extract human-interpretable features from neural network activations, but their learned features can vary substantially across random seeds and training choices. To improve stability, we studied weight regularization by adding L1 or L2 penalties on encoder and decoder weights, and evaluate how regularization interacts with common SAE training defaults. On MNIST, we observe that L2 weight regularization produces a core of highly aligned features and, when combined with tied initialization and unit-norm decoder constraints, it dramatically increases cross-seed feature consistency. For TopK SAEs trained on language model activations (Pythia-70M-deduped), adding a small L2 weight penalty increased the fraction of features shared across three random seeds and roughly doubles steering success rates, while leaving the mean of automated interpretability scores essentially unchanged. Finally, in the regularized setting, activation steering success becomes better predicted by auto-interpretability scores, suggesting that regularization can align text-based feature explanations with functional controllability.
翻译:稀疏自编码器(SAEs)被广泛用于从神经网络激活中提取人类可解释的特征,但其学习到的特征可能因随机种子和训练选择的不同而产生显著差异。为提高稳定性,我们研究了权重正则化方法,即在编码器和解码器权重上添加L1或L2惩罚项,并评估了正则化与常见SAE训练默认设置之间的相互作用。在MNIST数据集上,我们观察到L2权重正则化能够产生一组高度对齐的核心特征;当结合绑定初始化和单位范数解码器约束时,该方法显著提升了跨种子特征的一致性。对于基于语言模型激活(Pythia-70M-deduped)训练的TopK SAE,添加较小的L2权重惩罚项提高了三个随机种子间共享特征的比例,并将控制成功率提升约一倍,同时保持自动化可解释性评分的均值基本不变。最后,在正则化设置下,激活控制成功率能更好地通过自解释性评分进行预测,这表明正则化可以使基于文本的特征解释与功能可控性保持一致。