Multilingual Routing in Mixture-of-Experts

Mixture-of-Experts (MoE) architectures have become the key to scaling modern LLMs, yet little is understood about how their sparse routing dynamics respond to multilingual data. In this work, we analyze expert routing patterns using parallel multilingual datasets and present highly interpretable layer-wise phenomena. We find that MoE models route tokens in language-specific ways in the early and late decoder layers but exhibit significant cross-lingual routing alignment in middle layers, mirroring parameter-sharing trends observed in dense LLMs. In particular, we reveal a clear, strong correlation between a model's performance in a given language and how similarly its tokens are routed to English in these layers. Extending beyond correlation, we explore inference-time interventions that induce higher cross-lingual routing alignment. We introduce a method that steers the router by promoting middle-layer task experts frequently activated in English, and it successfully increases multilingual performance. These 1-2% gains are remarkably consistent across two evaluation tasks, three models, and 15+ languages, especially given that these simple interventions override routers of extensively trained, state-of-the-art LLMs. In comparison, interventions outside of the middle layers or targeting multilingual-specialized experts only yield performance degradation. Altogether, we present numerous findings that explain how MoEs process non-English text and demonstrate that generalization is limited by the model's ability to leverage language-universal experts in all languages.

翻译：专家混合（MoE）架构已成为扩展现代大型语言模型（LLM）的关键技术，然而其稀疏路由动态如何响应多语言数据尚不明确。本研究通过并行多语言数据集分析专家路由模式，揭示了高度可解释的层级现象。我们发现MoE模型在解码器的早期和后期层以语言特定的方式路由词元，但在中间层表现出显著的跨语言路由对齐，这与在稠密LLM中观察到的参数共享趋势相呼应。特别地，我们揭示了模型在特定语言上的性能与其词元在这些层中与英语的路由相似度之间存在清晰而强烈的相关性。除相关性分析外，我们探索了在推理时通过干预增强跨语言路由对齐的方法。我们提出一种通过促进英语中频繁激活的中间层任务专家来引导路由器的技术，该方法有效提升了多语言性能。这些1-2%的性能增益在两项评估任务、三种模型和超过15种语言中表现出惊人的一致性，尤其考虑到这些简单干预覆盖了经过充分训练的最先进LLM的路由器。相比之下，在中间层之外或针对多语言专用专家的干预仅会导致性能下降。总体而言，我们通过多项发现阐释了MoE如何处理非英语文本，并证明模型泛化能力受限于其在所有语言中利用语言通用专家的能力。