Sparse models, including sparse Mixture-of-Experts (MoE) models, have emerged as an effective approach for scaling Transformer models. However, they often suffer from computational inefficiency since a significant number of parameters are unnecessarily involved in computations via multiplying values by zero or low activation values. To address this issue, we present \tool, a novel MoE designed to enhance both the efficacy and efficiency of sparse MoE models. \tool leverages small experts and a threshold-based router to enable tokens to selectively engage only essential parameters. Our extensive experiments on language modeling and machine translation tasks demonstrate that \tool can enhance model performance while decreasing the computation load at MoE layers by over 50\% without sacrificing performance. Furthermore, we present the versatility of \tool by applying it to dense models, enabling sparse computation during inference. We provide a comprehensive analysis and make our code available at https://github.com/ysngki/XMoE.
翻译:稀疏模型,包括稀疏专家混合(MoE)模型,已成为扩展Transformer模型的有效方法。然而,它们通常存在计算效率低下的问题,因为大量参数通过乘以零或低激活值的方式不必要地参与计算。为解决此问题,我们提出了\tool,一种新颖的MoE模型,旨在提升稀疏MoE模型的效能与效率。\tool利用小型专家和基于阈值的路由器,使令牌能够选择性地仅调用必要参数。我们在语言建模和机器翻译任务上的大量实验表明,\tool能够在保持性能的同时,将MoE层的计算负载降低50%以上,并提升模型性能。此外,我们通过将\tool应用于稠密模型,展示了其多功能性,实现了推理过程中的稀疏计算。我们提供了全面的分析,并将代码开源在https://github.com/ysngki/XMoE。