Most interpretability work focuses on layer- or neuron-level mechanisms in Transformers, leaving expert-level behavior in MoE LLMs underexplored. Motivated by functional specialization in the human brain, we analyze expert activation by distinguishing domain and driver experts. In this work, we study expert activation in MoE models across three public domains and address two key questions: (1) which experts are activated, and whether certain expert types exhibit consistent activation patterns; and (2) how tokens are associated with and trigger the activation of specific experts. To answer these questions, we introduce entropy-based and causal-effect metrics to assess whether an expert is strongly favored for a particular domain, and how strongly expert activation contributes causally to the model's output, thus identify domain and driver experts, respectively. Furthermore, we explore how individual tokens are associated with the activation of specific experts. Our analysis reveals that (1) Among the activated experts, some show clear domain preferences, while others exert strong causal influence on model performance, underscoring their decisive roles. (2) tokens occurring earlier in a sentence are more likely to trigger the driver experts, and (3) adjusting the weights of domain and driver experts leads to significant performance gains across all three models and domains. These findings shed light on the internal mechanisms of MoE models and enhance their interpretability.
翻译:大多数可解释性研究聚焦于Transformer的层级或神经元级机制,而对MoE大语言模型中专家级行为的探索仍显不足。受人类大脑功能特化的启发,我们通过区分领域专家与驱动专家来分析专家激活机制。本研究在三个公共领域中考察MoE模型的专家激活情况,并解决两个关键问题:(1)哪些专家被激活,以及特定专家类型是否呈现一致的激活模式;(2)词元如何与特定专家关联并触发其激活。为回答这些问题,我们引入基于熵的度量与因果效应指标,分别用于评估专家是否对特定领域具有强烈偏好,以及专家激活对模型输出的因果贡献强度,从而识别领域专家与驱动专家。进一步地,我们探究了单个词元如何与特定专家的激活产生关联。分析结果表明:(1)在激活的专家中,部分专家表现出明确的领域偏好,而另一些专家对模型性能具有强因果影响,彰显其决定性作用;(2)句中较早出现的词元更可能触发驱动专家;(3)调整领域专家与驱动专家的权重可在所有三个模型与领域中带来显著的性能提升。这些发现揭示了MoE模型的内部机制,并增强了其可解释性。