Mixture-of-Experts (MoE) architectures have become the dominant choice for scaling Large Language Models (LLMs), activating only a subset of parameters per token. While MoE architectures are primarily adopted for computational efficiency, it remains an open question whether their sparsity makes them inherently easier to interpret than dense feed-forward networks (FFNs). We compare MoE experts and dense FFNs using $k$-sparse probing and find that expert neurons are consistently less polysemantic, with the gap widening as routing becomes sparser. This suggests that sparsity pressures both individual neurons and entire experts toward monosemanticity. Leveraging this finding, we zoom out from the neuron to the expert level as a more effective unit of analysis. We validate this approach by automatically interpreting hundreds of experts. This analysis allows us to resolve the debate on specialization: experts are neither broad domain specialists (e.g., biology) nor simple token-level processors. Instead, they function as fine-grained task experts, specializing in linguistic operations or semantic tasks (e.g., closing brackets in LaTeX). Our findings suggest that MoEs are inherently interpretable at the expert level, providing a clearer path toward large-scale model interpretability. Code is available at: https://github.com/jerryy33/MoE_analysis
翻译:混合专家(MoE)架构已成为扩展大规模语言模型(LLMs)的主流选择,每个词元仅激活部分参数。尽管MoE架构主要因计算效率而被采用,但其稀疏性是否使其本质上比密集前馈网络(FFNs)更易于解读仍是一个未解之谜。我们利用$k$-稀疏探针比较MoE专家与密集FFN,发现专家神经元的歧义性持续降低,且随着路由稀疏化程度加深,这一差距进一步扩大。这表明稀疏性迫使单个神经元乃至整个专家趋向单义性。基于此发现,我们从神经元层面拓展至专家层面,将其视为更有效的分析单元。通过自动解读数百个专家,我们验证了该方法的有效性。这一分析使我们得以化解关于专业化的争议:专家既非宽泛的领域专家(如生物学),也非简单的词元级处理器。相反,它们充当细粒度的任务专家,专门处理语言操作或语义任务(如LaTeX语法中的闭合括号)。我们的研究表明,MoE在专家层面具有内在的可解读性,为大规模模型的可解释性提供了更清晰的路径。代码地址:https://github.com/jerryy33/MoE_analysis