The Mixture of Experts (MoE) for language models has been proven effective in augmenting the capacity of models by dynamically routing each input token to a specific subset of experts for processing. Despite the success, most existing methods face a challenge for balance between sparsity and the availability of expert knowledge: enhancing performance through increased use of expert knowledge often results in diminishing sparsity during expert selection. To mitigate this contradiction, we propose HyperMoE, a novel MoE framework built upon Hypernetworks. This framework integrates the computational processes of MoE with the concept of knowledge transferring in multi-task learning. Specific modules generated based on the information of unselected experts serve as supplementary information, which allows the knowledge of experts not selected to be used while maintaining selection sparsity. Our comprehensive empirical evaluations across multiple datasets and backbones establish that HyperMoE significantly outperforms existing MoE methods under identical conditions concerning the number of experts.
翻译:语言模型的混合专家(Mixture of Experts, MoE)方法通过将每个输入令牌动态路由到特定的专家子集进行处理,已被证明能有效提升模型容量。尽管取得了成功,但现有方法大多面临稀疏性与专家知识可用性之间的平衡挑战:通过增加专家知识的使用来提升性能,通常会导致专家选择过程中的稀疏性降低。为了缓解这一矛盾,我们提出了HyperMoE,一个基于超网络(Hypernetworks)构建的新型MoE框架。该框架将MoE的计算过程与多任务学习中的知识迁移概念相结合。基于未选中专家信息生成的特定模块作为补充信息,使得在保持选择稀疏性的同时,能够利用未被选中专家的知识。我们在多个数据集和骨干模型上进行全面的实证评估,结果表明在专家数量相同的条件下,HyperMoE显著优于现有的MoE方法。