The Mixture of Experts (MoE) approach is well-suited for multilingual and code-switching (CS) tasks due to its multi-expert architecture. This work introduces the DLG-MoE, a Dynamic Language Group-based MoE optimized for bilingual and CS scenarios. DLG-MoE operates based on a hierarchical routing mechanism. First, the language router explicitly models the language and dispatches the representations to the corresponding language expert groups. Subsequently, the unsupervised router within each language group implicitly models attributes beyond language, and coordinates expert routing and collaboration. The model achieves state-of-the-art (SOTA) performance while also having unparalleled flexibility. It supports different top-k inference and streaming capabilities, and can also prune the model parameters to obtain a monolingual sub-model. The Code will be released.
翻译:混合专家(MoE)方法因其多专家架构而特别适用于多语言及语码转换(CS)任务。本文提出了DLG-MoE,一种基于动态语言组、针对双语及CS场景优化的MoE模型。DLG-MoE采用层次化路由机制运行:首先,语言路由器显式建模语言信息,并将表征分发至相应的语言专家组;随后,各语言组内的无监督路由器隐式建模语言之外的属性,并协调专家路由与协作。该模型在实现最先进(SOTA)性能的同时,具备前所未有的灵活性——支持不同的top-k推理与流式处理能力,并能通过剪枝模型参数获得单语言子模型。代码将公开释放。