揭秘混合专家大语言模型中的超级专家 (Unveiling Super Experts in Mixture-of-Experts Large Language Models)

In this study, we report, for the first time, the discovery and systematic investigation of a distinct subset of experts that play a pivotal role in the MoE LLMs' forward inference. These experts are prevalent in open-source MoE LLMs, and despite their extremely limited number, pruning them results in a substantial decline in model performance (e.g., prune just three out of 6,144 causes Qwen3-30B-A3B to generate repetitive and uninformative outputs).We refer to these experts as Super Experts (SEs). Our comprehensive analysis provides progressively deeper insights into SEs: (i) SEs are characterized by rare but extreme activation outliers in the output of the down_proj, which give rise to massive activations in the hidden states between decoder layers. Moreover, the distribution of SEs is model-specific, data-agnostic, and remains unaffected by post-training processes. (ii) By pruning SEs, we assess their significance across a variety of tasks, revealing their considerable impact on the model's overall performance, particularly in mathematical reasoning. (iii) We further investigate why compressing SEs exerts such a pronounced impact. We show that, in MoE LLMs, SEs serve as the primary source of the systematic outlier mechanism in Transformers, and that compressing them profoundly disrupts this process, ultimately causing the collapse of attention sinks. These findings advance the understanding of the internal dynamics of MoE LLMs, filling an important gap in the current knowledge. The code is provided in https://github.com/ZunhaiSu/Super-Experts-Profilling.

翻译：本研究首次报告了在混合专家大语言模型前向推理中起关键作用的一类独特专家子集的发现与系统性研究。这些专家在开源混合专家大语言模型中普遍存在，尽管其数量极少，但剪枝它们会导致模型性能显著下降（例如，在Qwen3-30B-A3B模型的6,144个专家中仅剪枝3个，就会导致模型生成重复且无信息量的输出）。我们将这些专家称为超级专家。我们的综合分析逐步深入地揭示了超级专家的特性：（i）超级专家的特征在于其down_proj输出中罕见但极端的激活离群值，这导致了解码器层间隐藏状态中出现大量激活。此外，超级专家的分布具有模型特异性、与数据无关，且不受训练后过程的影响。（ii）通过剪枝超级专家，我们评估了其在多种任务中的重要性，揭示了它们对模型整体性能（尤其在数学推理方面）的显著影响。（iii）我们进一步探究了压缩超级专家为何会产生如此显著的影响。研究表明，在混合专家大语言模型中，超级专家是Transformer中系统性离群机制的主要来源，压缩它们会严重破坏这一过程，最终导致注意力汇聚点的崩溃。这些发现推进了对混合专家大语言模型内部动态的理解，填补了当前知识的一个重要空白。代码发布于 https://github.com/ZunhaiSu/Super-Experts-Profilling。