Deconstructing Pre-training: Knowledge Attribution Analysis in MoE and Dense Models

Mixture-of-Experts (MoE) architectures decouple model capacity from per-token computation, enabling scaling beyond the computational limits imposed by dense scaling laws. Yet how MoE architectures shape knowledge acquisition during pre-training, and how this process differs from dense architectures, remains unknown. To address this issue, we introduce Gated-LPI (Log-Probability Increase), a neuron-level attribution metric that decomposes log-probability increase across neurons. We present a time-resolved comparison of knowledge acquisition dynamics in MoE and dense architectures, tracking checkpoints over 1.2M training steps (~ 5.0T tokens) and 600K training steps (~ 2.5T tokens), respectively. Our experiments uncover three patterns: (1) Low-entropy backbone. The top approximately 1% of MoE neurons capture over 45% of positive updates, forming a high-utility core, which is absent in the dense baseline. (2) Early consolidation. The MoE model locks into a stable importance profile within < 100K steps, whereas the dense model remains volatile throughout training. (3) Functional robustness. Masking the ten most important MoE attention heads reduces relational HIT@10 by < 10%, compared with > 50% for the dense model, showing that sparsity fosters distributed -- rather than brittle -- knowledge storage. These patterns collectively demonstrate that sparsity fosters an intrinsically stable and distributed computational backbone from early in training, helping bridge the gap between sparse architectures and training-time interpretability.

翻译：混合专家（Mixture-of-Experts, MoE）架构将模型容量与每令牌计算解耦，从而能够突破稠密缩放定律所施加的计算限制。然而，MoE架构如何在预训练过程中塑造知识获取，以及这一过程与稠密架构有何差异，目前仍不清楚。为解决这一问题，我们提出了Gated-LPI（对数概率增量），这是一种神经元级别的归因度量方法，可将对数概率增量分解到各个神经元。我们对MoE与稠密架构的知识获取动态进行了时间分辨的比较，分别追踪了超过120万训练步（约5.0T令牌）和60万训练步（约2.5T令牌）的检查点。我们的实验揭示了三种模式：（1）低熵骨干网络。MoE模型中排名前约1%的神经元捕获了超过45%的正向更新，形成了一个高效用的核心，而这一核心在稠密基线模型中并不存在。（2）早期固化。MoE模型在不到10万步内便锁定了一个稳定的重要性分布，而稠密模型在整个训练过程中则保持波动。（3）功能鲁棒性。屏蔽MoE模型中十个最重要的注意力头仅使关系HIT@10下降不到10%，而稠密模型的下降幅度则超过50%，这表明稀疏性促进了分布式而非脆弱的知识存储。这些模式共同证明，稀疏性从训练早期就促成了一个本质上稳定且分布式的计算骨干，有助于弥合稀疏架构与训练时可解释性之间的差距。