Identifying the features learned by neural networks is a core challenge in mechanistic interpretability. Sparse autoencoders (SAEs), which learn a sparse, overcomplete dictionary that reconstructs a network's internal activations, have been used to identify these features. However, SAEs may learn more about the structure of the datatset than the computational structure of the network. There is therefore only indirect reason to believe that the directions found in these dictionaries are functionally important to the network. We propose end-to-end (e2e) sparse dictionary learning, a method for training SAEs that ensures the features learned are functionally important by minimizing the KL divergence between the output distributions of the original model and the model with SAE activations inserted. Compared to standard SAEs, e2e SAEs offer a Pareto improvement: They explain more network performance, require fewer total features, and require fewer simultaneously active features per datapoint, all with no cost to interpretability. We explore geometric and qualitative differences between e2e SAE features and standard SAE features. E2e dictionary learning brings us closer to methods that can explain network behavior concisely and accurately. We release our library for training e2e SAEs and reproducing our analysis at https://github.com/ApolloResearch/e2e_sae
翻译:识别神经网络学习到的特征是机制可解释性领域的核心挑战。稀疏自编码器通过学习能够重构网络内部激活的稀疏过完备字典,已被用于识别这些特征。然而,SAE可能更多学习数据集的结构而非网络的计算结构。因此,仅存在间接证据表明这些字典中发现的方向对网络具有功能重要性。我们提出端到端稀疏字典学习方法,通过最小化原始模型与插入SAE激活后模型的输出分布之间的KL散度,确保所学特征具有功能重要性。与标准SAE相比,端到端SAE实现了帕累托改进:在保持可解释性的前提下,能够解释更多网络性能、需要更少特征总数、每个数据点需要更少的同步激活特征。我们探究了端到端SAE特征与标准SAE特征在几何性质和定性特征上的差异。端到端字典学习使我们更接近能够简洁准确解释网络行为的方法。我们在https://github.com/ApolloResearch/e2e_sae 发布了用于训练端到端SAE及复现分析的代码库。