Sparse autoencoders (SAEs) are widely used to extract human-interpretable features from neural network activations, but their learned features can vary substantially across random seeds and training choices. To improve stability, we studied weight regularization by adding L1 or L2 penalties on encoder and decoder weights, and evaluate how regularization interacts with common SAE training defaults. On MNIST, we observe that L2 weight regularization produces a core of highly aligned features and, when combined with tied initialization and unit-norm decoder constraints, it dramatically increases cross-seed feature consistency. For TopK SAEs trained on language model activations (Pythia-70M-deduped), adding a small L2 weight penalty increased the fraction of features shared across three random seeds and roughly doubles steering success rates, while leaving the mean of automated interpretability scores essentially unchanged. Finally, in the regularized setting, activation steering success becomes better predicted by auto-interpretability scores, suggesting that regularization can align text-based feature explanations with functional controllability.
翻译:稀疏自编码器(SAE)广泛应用于从神经网络激活中提取人类可解释的特征,但其学习到的特征会因随机种子和训练选择的差异而产生显著变化。为提升稳定性,我们通过向编码器与解码器权重施加L1或L2惩罚项来研究权重正则化,并评估正则化与常见SAE训练默认设置之间的相互作用。在MNIST数据集上,我们观察到L2权重正则化能产生高度对齐的核心特征,且当与绑定初始化及单位范数解码器约束结合使用时,可显著提升跨随机种子的特征一致性。针对基于语言模型激活(Pythia-70M-deduped)训练的TopK SAE,施加较小的L2权重惩罚可使三个随机种子间的共享特征比例增加,并将导向成功率的粗略值翻倍,同时自动可解释性得分的均值基本保持不变。最后,在正则化设定下,激活导向成功率与自动解释性得分之间的预测关联性增强,表明正则化能使基于文本的特征解释与功能可控性保持一致。