Mixture-of-Experts (MoE) models enable efficient scaling of large language models (LLMs) by activating only a subset of experts per input. However, we observe that the commonly used auxiliary load balancing loss often leads to expert overlap and overly uniform routing, which hinders expert specialization and degrades overall performance during post-training. To address this, we propose a simple yet effective solution that introduces two complementary objectives: (1) an orthogonality loss to encourage experts to process distinct types of tokens, and (2) a variance loss to encourage more discriminative routing decisions. Gradient-level analysis demonstrates that these objectives are compatible with the existing auxiliary loss and contribute to optimizing the training process. Experimental results over various model architectures and across multiple benchmarks show that our method significantly enhances expert specialization. Notably, our method improves classic MoE baselines with auxiliary loss by up to 23.79%, while also maintaining load balancing in downstream tasks, without any architectural modifications or additional components. We will release our code to contribute to the community.
翻译:混合专家(Mixture-of-Experts, MoE)模型通过仅为每个输入激活一部分专家,实现了大型语言模型(LLMs)的高效扩展。然而,我们观察到,常用的辅助负载均衡损失常常导致专家重叠和过于均匀的路由,这阻碍了专家的专业化,并在后续训练过程中降低了整体性能。为解决这一问题,我们提出了一种简单而有效的解决方案,引入了两个互补的目标:(1)正交性损失,以鼓励专家处理不同类型的标记;(2)方差损失,以鼓励更具区分度的路由决策。梯度层面的分析表明,这些目标与现有的辅助损失是兼容的,并有助于优化训练过程。在不同模型架构和多个基准测试上的实验结果表明,我们的方法显著增强了专家的专业化。值得注意的是,我们的方法将带有辅助损失的经典MoE基线性能提升了高达23.79%,同时在下游任务中保持了负载均衡,且无需任何架构修改或额外组件。我们将公开我们的代码,以回馈社区。