Sparse autoencoders (SAEs) are used to decompose neural network activations into human-interpretable features. Typically, features learned by a single SAE are used for downstream applications. However, it has recently been shown that a single SAE captures only a limited subset of features that can be extracted from the activation space. Motivated by this limitation, we introduce and formalize SAE ensembles. Furthermore, we propose to ensemble multiple SAEs through naive bagging and boosting. In naive bagging, SAEs trained with different weight initializations are ensembled, whereas in boosting SAEs sequentially trained to minimize the residual error are ensembled. Theoretically, naive bagging and boosting are justified as approaches to reduce reconstruction error. Empirically, we evaluate our ensemble approaches with three settings of language models and SAE architectures. Our empirical results demonstrate that, compared to an expanded SAE that matches the number of features in the ensemble, ensembling SAEs improves the reconstruction of language model activations along with SAE stability. Additionally, on downstream tasks such as concept detection and spurious correlation removal, SAE ensembles achieve better performance, showing improved practical utility.
翻译:稀疏自编码器(SAEs)用于将神经网络激活分解为人类可解释的特征。通常,单个SAE学习到的特征被用于下游应用。然而,最近的研究表明,单个SAE仅能捕获可从激活空间中提取的特征的有限子集。受此局限性的启发,我们引入并形式化了SAE集成方法。此外,我们提出通过朴素装袋和提升方法集成多个SAE。在朴素装袋中,集成使用不同权重初始化训练的SAE;而在提升方法中,集成顺序训练以最小化残差误差的SAE。理论上,朴素装袋和提升方法被证明是降低重构误差的有效方法。实验上,我们在三种语言模型和SAE架构的设置下评估了集成方法。我们的实验结果表明,与一个匹配集成中特征数量的扩展SAE相比,集成SAE能更好地重构语言模型激活并提高SAE稳定性。此外,在概念检测和虚假相关性去除等下游任务中,SAE集成实现了更优的性能,显示出更高的实际效用。