By routing input tokens to only a few split experts, Sparse Mixture-of-Experts has enabled efficient training of large language models. Recent findings suggest that fixing the routers can achieve competitive performance by alleviating the collapsing problem, where all experts eventually learn similar representations. However, this strategy has two key limitations: (i) the policy derived from random routers might be sub-optimal, and (ii) it requires extensive resources during training and evaluation, leading to limited efficiency gains. This work introduces \HyperRout, which dynamically generates the router's parameters through a fixed hypernetwork and trainable embeddings to achieve a balance between training the routers and freezing them to learn an improved routing policy. Extensive experiments across a wide range of tasks demonstrate the superior performance and efficiency gains of \HyperRouter compared to existing routing methods. Our implementation is publicly available at {\url{{https://github.com/giangdip2410/HyperRouter}}}.
翻译:通过将输入令牌路由至少量拆分专家,稀疏混合专家(Sparse Mixture-of-Experts)实现了大规模语言模型的高效训练。最新研究表明,固定路由策略可通过缓解专家坍塌问题(即所有专家最终学习到相似表征)取得竞争性性能。然而该策略存在两个关键局限:(i)随机路由器导出的策略可能非最优;(ii)训练和评估阶段需消耗大量资源,导致效率增益有限。本文提出HyperRouter方法,通过固定超网络与可训练嵌入动态生成路由器参数,在训练路由器与冻结策略之间取得平衡,从而习得改进的路由策略。跨多领域的广泛实验表明,相较于现有路由方法,HyperRouter在性能与效率方面均表现卓越。我们的实现已公开于:https://github.com/giangdip2410/HyperRouter。