A Unified Theory of Sparse Dictionary Learning in Mechanistic Interpretability: Piecewise Biconvexity and Spurious Minima

As AI models achieve remarkable capabilities across diverse domains, understanding what representations they learn and how they encode concepts has become increasingly important for both scientific progress and trustworthy deployment. Recent works in mechanistic interpretability have widely reported that neural networks represent meaningful concepts as linear directions in their representation spaces and often encode diverse concepts in superposition. Various sparse dictionary learning (SDL) methods, including sparse autoencoders, transcoders, and crosscoders, are utilized to address this by training auxiliary models with sparsity constraints to disentangle these superposed concepts into monosemantic features. These methods are the backbone of modern mechanistic interpretability, yet in practice they consistently produce polysemantic features, feature absorption, and dead neurons, with very limited theoretical understanding of why these phenomena occur. Existing theoretical work is limited to tied-weight sparse autoencoders, leaving the broader family of SDL methods without formal grounding. We develop the first unified theoretical framework that casts all major SDL variants as a single piecewise biconvex optimization problem, and characterize its global solution set, non-identifiability, and spurious optima. This analysis yields principled explanations for feature absorption and dead neurons. To expose these pathologies under full ground-truth access, we introduce the Linear Representation Bench. Guided by our theory, we propose feature anchoring, a novel technique that restores SDL identifiability, substantially improving feature recovery across synthetic benchmarks and real neural representations.

翻译：随着AI模型在多个领域展现出卓越能力，理解它们学习何种表征以及如何编码概念，对于科学进步和可信部署都变得日益重要。近期机制可解释性研究广泛报道，神经网络在其表征空间中将有意义概念表示为线性方向，并常常以叠加方式编码多种概念。为此，各种稀疏字典学习方法，包括稀疏自编码器、转录编码器和交叉编码器，被用于训练具有稀疏约束的辅助模型，以将这些叠加概念解耦为单语义特征。这些方法是现代机制可解释性的核心，但在实践中它们始终产生多语义特征、特征吸收和死亡神经元，而对这些现象出现原因的理论理解极为有限。现有的理论工作仅限于权重共享的稀疏自编码器，使得更广泛的稀疏字典学习方法缺乏形式化理论基础。我们提出了首个统一理论框架，将所有主要的稀疏字典学习变体归结为单一的分段双凸优化问题，并刻画了其全局解集合、不可辨识性和虚假最优值。该分析为特征吸收和死亡神经元提供了原理性解释。为了在完全已知真实数据的情况下揭示这些病理现象，我们引入了线性表征基准。受我们理论的启发，我们提出了特征锚定——一种恢复稀疏字典学习可辨识性的新技术，在合成基准和真实神经表征上大幅提升了特征恢复效果。