Sparse Mixture-of-Experts (MoE) models scale Transformers efficiently but suffer from expert overlap -- redundant representations across experts and routing ambiguity, resulting in severely underutilized model capacity. While architectural solutions like DeepSeekMoE promote specialization, they require substantial structural modifications and rely solely on intra-layer signals. In this paper, we propose two plug-and-play regularization losses that enhance MoE specialization and routing efficiency without modifying router or model architectures. First, an intra-layer specialization loss penalizes cosine similarity between experts' SwiGLU activations on identical tokens, encouraging experts to specialize in complementary knowledge. Second, a cross-layer coupling loss maximizes joint Top-$k$ routing probabilities across adjacent layers, establishing coherent expert pathways through network depth while reinforcing intra-layer expert specialization. Both losses are orthogonal to the standard load-balancing loss and compatible with both the shared-expert architecture in DeepSeekMoE and vanilla top-$k$ MoE architectures. We implement both losses as a drop-in Megatron-LM module. Extensive experiments across pre-training, fine-tuning, and zero-shot benchmarks demonstrate consistent task gains, higher expert specialization, and lower-entropy routing; together, these improvements translate into faster inference via more stable expert pathways.
翻译:稀疏混合专家模型能够高效扩展Transformer架构,但存在专家重叠问题——即专家间表征冗余与路由模糊性,导致模型容量严重未充分利用。尽管DeepSeekMoE等架构方案能促进专业化,但需要进行大量结构修改且仅依赖层内信号。本文提出两种即插即用的正则化损失,无需修改路由器或模型架构即可提升MoE专业化程度与路由效率。首先,层内专业化损失通过惩罚相同词元上专家SwiGLU激活的余弦相似度,促使专家在互补知识领域实现专业化。其次,跨层耦合损失通过最大化相邻层间的联合Top-$k$路由概率,在网络深度维度建立连贯的专家路径,同时强化层内专家专业化。两种损失均与标准负载均衡损失正交,且兼容DeepSeekMoE的共享专家架构和经典top-$k$ MoE架构。我们将这两种损失实现为即插即用的Megatron-LM模块。在预训练、微调和零样本基准测试中的大量实验表明,该方法能持续提升任务性能、增强专家专业化程度并降低路由熵值;这些改进共同通过更稳定的专家路径实现了更快的推理速度。