Sparsely activated Mixture-of-Experts (MoE) models are widely adopted to scale up model capacity without increasing the computation budget. However, vanilla TopK routers are trained in a discontinuous, non-differentiable way, limiting their performance and scalability. To address this issue, we propose ReMoE, a fully differentiable MoE architecture that offers a simple yet effective drop-in replacement for the conventional TopK+Softmax routing, utilizing ReLU as the router instead. We further propose methods to regulate the router's sparsity while balancing the load among experts. ReMoE's continuous nature enables efficient dynamic allocation of computation across tokens and layers, while also exhibiting domain specialization. Our experiments demonstrate that ReMoE consistently outperforms vanilla TopK-routed MoE across various model sizes, expert counts, and levels of granularity. Furthermore, ReMoE exhibits superior scalability with respect to the number of experts, surpassing traditional MoE architectures. The implementation based on Megatron-LM is available at https://github.com/thu-ml/ReMoE.
翻译:稀疏激活的专家混合模型被广泛用于在不增加计算预算的情况下扩展模型容量。然而,传统的TopK路由器是以不连续、不可微分的方式进行训练的,这限制了其性能和可扩展性。为解决此问题,我们提出了ReMoE,一种全可微分的MoE架构。它使用ReLU作为路由器,为传统的TopK+Softmax路由提供了一种简单而有效的即插即用替代方案。我们进一步提出了调节路由器稀疏性同时平衡专家间负载的方法。ReMoE的连续性使其能够高效地在不同标记和层之间动态分配计算,同时还展现出领域专业化特性。我们的实验表明,在不同的模型大小、专家数量和粒度水平下,ReMoE始终优于传统的TopK路由MoE。此外,ReMoE在专家数量方面展现出更优的可扩展性,超越了传统的MoE架构。基于Megatron-LM的实现可在 https://github.com/thu-ml/ReMoE 获取。