On the Theoretical Foundation of Sparse Dictionary Learning in Mechanistic Interpretability

As AI models achieve remarkable capabilities across diverse domains, understanding what representations they learn and how they process information has become increasingly important for both scientific progress and trustworthy deployment. Recent works in mechanistic interpretability have shown that neural networks represent meaningful concepts as directions in their representation spaces and often encode diverse concepts in superposition. Various sparse dictionary learning (SDL) methods, including sparse autoencoders, transcoders, and crosscoders, address this by training auxiliary models with sparsity constraints to disentangle these superposed concepts into monosemantic features. These methods have demonstrated remarkable empirical success but have limited theoretical understanding. Existing theoretical work is limited to sparse autoencoders with tied-weight constraints, leaving the broader family of SDL methods without formal grounding. In this work, we develop the first unified theoretical framework considering SDL as one optimization problem. We demonstrate how diverse methods instantiate the theoretical framework and provide rigorous analysis of the optimization landscape. We provide novel theoretical explanations for empirically observed phenomena, including feature absorption and dead neurons. We design the Linear Representation Bench, a benchmark that strictly follows the Linear Representation Hypothesis, to evaluate SDL methods with fully accessible ground-truth features. Motivated by our theory and findings, we develop feature achoring, a novel technique applicable for all SDL methods, to enhance their feature recovery capabilities.

翻译：随着人工智能模型在多个领域展现出卓越能力，理解它们学习何种表示以及如何处理信息，对于科学进步和可信部署都变得日益重要。机制可解释性领域的最新研究表明，神经网络将有意义的概念表示为表示空间中的方向，并经常以叠加方式编码多种概念。各种稀疏字典学习方法，包括稀疏自编码器、转码器和交叉编码器，通过训练具有稀疏性约束的辅助模型，将这些叠加概念解耦为单语义特征，从而解决这一问题。这些方法已展现出显著的实证成功，但其理论基础尚不完善。现有的理论工作仅限于具有权重绑定约束的稀疏自编码器，使得更广泛的稀疏字典学习方法缺乏形式化基础。在本工作中，我们首次提出了一个将稀疏字典学习视为单一优化问题的统一理论框架。我们展示了不同方法如何实例化该理论框架，并对优化景观进行了严格分析。我们为实证观察到的现象（包括特征吸收和死亡神经元）提供了新颖的理论解释。我们设计了严格遵循线性表示假设的基准测试平台——线性表示基准，用以在完全可访问真实特征的条件下评估稀疏字典学习方法。受我们的理论和发现启发，我们开发了一种适用于所有稀疏字典学习方法的新技术——特征锚定，以增强其特征恢复能力。