In this study, we systematically evaluate the impact of common design choices in Mixture of Experts (MoEs) on validation performance, uncovering distinct influences at token and sequence levels. We also present empirical evidence showing comparable performance between a learned router and a frozen, randomly initialized router, suggesting that learned routing may not be essential. Our study further reveals that Sequence-level routing can result in topic-specific weak expert specialization, in contrast to syntax specialization observed with Token-level routing.
翻译:在本研究中,我们系统评估了混合专家模型(MoE)中常见设计选择对验证性能的影响,揭示了其在词元级和序列级的不同作用机制。我们还提供了实证证据,表明学习型路由与冻结的随机初始化路由性能相当,这暗示学习路由可能并非必要。进一步研究表明,与词元级路由中观察到的语法专业化相反,序列级路由可能导致针对特定主题的弱专家专业化现象。