The Mixture of Experts (MoE) for language models has been proven effective in augmenting the capacity of models by dynamically routing each input token to a specific subset of experts for processing. Despite the success, most existing methods face a challenge for balance between sparsity and the availability of expert knowledge: enhancing performance through increased use of expert knowledge often results in diminishing sparsity during expert selection. To mitigate this contradiction, we propose HyperMoE, a novel MoE framework built upon Hypernetworks. This framework integrates the computational processes of MoE with the concept of knowledge transferring in multi-task learning. Specific modules generated based on the information of unselected experts serve as supplementary information, which allows the knowledge of experts not selected to be used while maintaining selection sparsity. Our comprehensive empirical evaluations across multiple datasets and backbones establish that HyperMoE significantly outperforms existing MoE methods under identical conditions concerning the number of experts.
翻译:混合专家模型(Mixture of Experts, MoE)已被证明能够通过动态路由每个输入令牌至特定专家子集进行处理,从而有效增强语言模型的能力。尽管取得了成功,现有方法在稀疏性与专家知识可用性之间仍面临平衡挑战:通过增加专家知识使用量来提升性能,往往会导致专家选择过程中的稀疏性降低。为解决这一矛盾,我们提出HyperMoE——一种基于超网络(Hypernetworks)的新型MoE框架。该框架将MoE的计算过程与多任务学习中的知识迁移概念相结合。基于未选中专家信息生成的特定模块作为补充信息,使得在维持选择稀疏性的同时能够利用未选中专家的知识。我们在多个数据集和骨干架构上的综合实证评估表明:在专家数量相同的条件下,HyperMoE的性能显著优于现有MoE方法。