Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations. We introduce the Gated Sparse Autoencoder (Gated SAE), which achieves a Pareto improvement over training with prevailing methods. In SAEs, the L1 penalty used to encourage sparsity introduces many undesirable biases, such as shrinkage -- systematic underestimation of feature activations. The key insight of Gated SAEs is to separate the functionality of (a) determining which directions to use and (b) estimating the magnitudes of those directions: this enables us to apply the L1 penalty only to the former, limiting the scope of undesirable side effects. Through training SAEs on LMs of up to 7B parameters we find that, in typical hyper-parameter ranges, Gated SAEs solve shrinkage, are similarly interpretable, and require half as many firing features to achieve comparable reconstruction fidelity.
翻译:近期研究发现,稀疏自编码器(SAE)通过寻找语言模型(LM)激活中的稀疏线性重构,是用于无监督发现语言模型可解释特征的有效技术。我们提出了门控稀疏自编码器(Gated SAE),其在当前主流训练方法基础上实现了帕累托改进。在SAE中,用于促进稀疏性的L1惩罚项会引入许多不良偏差,例如收缩——对特征激活值的系统性低估。门控SAE的关键创新在于将(a)确定使用哪些方向的功能与(b)估计这些方向幅度的功能分离:这使我们能够仅对前者施加L1惩罚,从而限制不良副作用的范围。通过在参数规模达7B的LM上训练SAE,我们发现典型超参数范围内,门控SAE解决了收缩问题,具有相似的可解释性,且实现同等重构保真度所需的激活特征数量减少一半。