Multilayer perceptrons (MLPs) are an integral part of large language models, yet their dense representations render them difficult to understand, edit, and steer. Recent methods learn interpretable approximations via neuron-level sparsity, yet fail to faithfully reconstruct the original mapping--significantly increasing model's next-token cross-entropy loss. In this paper, we advocate for moving to layer-level sparsity to overcome the accuracy trade-off in sparse layer approximation. Under this paradigm, we introduce Mixture of Decoders (MxDs). MxDs generalize MLPs and Gated Linear Units, expanding pre-trained dense layers into tens of thousands of specialized sublayers. Through a flexible form of tensor factorization, each sparsely activating MxD sublayer implements a linear transformation with full-rank weights--preserving the original decoders' expressive capacity even under heavy sparsity. Experimentally, we show that MxDs significantly outperform state-of-the-art methods (e.g., Transcoders) on the sparsity-accuracy frontier in language models with up to 3B parameters. Further evaluations on sparse probing and feature steering demonstrate that MxDs learn similarly specialized features of natural language--opening up a promising new avenue for designing interpretable yet faithful decompositions. Our code is included at: https://github.com/james-oldfield/MxD/.
翻译:多层感知机(MLPs)是大型语言模型的核心组成部分,但其稠密表示特性使其难以理解、编辑与调控。现有方法通过神经元级稀疏性学习可解释的近似表示,但未能忠实重构原始映射——显著增加了模型的下一个词元交叉熵损失。本文主张转向层级稀疏性以克服稀疏层近似中的精度权衡问题。在此范式下,我们提出解码器混合模型(MxDs)。MxDs 泛化了 MLPs 与门控线性单元,将预训练稠密层扩展为数万个专用子层。通过灵活的张量分解形式,每个稀疏激活的 MxD 子层均实现了具有满秩权重的线性变换——即使在高度稀疏条件下仍能保持原始解码器的表达能力。实验表明,在参数量高达 30 亿的语言模型中,MxDs 在稀疏性-精度前沿性能上显著优于现有最佳方法(如 Transcoders)。稀疏探测与特征调控的进一步评估证明,MxDs 能够学习到类似的专业化自然语言特征——为设计兼具可解释性与忠实性的分解方法开辟了新途径。代码已发布于:https://github.com/james-oldfield/MxD/。