DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Damai Dai,Chengqi Deng,Chenggang Zhao,R. X. Xu,Huazuo Gao,Deli Chen,Jiashi Li,Wangding Zeng,Xingkai Yu,Y. Wu,Zhenda Xie,Y. K. Li,Panpan Huang,Fuli Luo,Chong Ruan,Zhifang Sui,Wenfeng Liang

In the era of large language models, Mixture-of-Experts (MoE) is a promising architecture for managing computational costs when scaling up model parameters. However, conventional MoE architectures like GShard, which activate the top-$K$ out of $N$ experts, face challenges in ensuring expert specialization, i.e. each expert acquires non-overlapping and focused knowledge. In response, we propose the DeepSeekMoE architecture towards ultimate expert specialization. It involves two principal strategies: (1) finely segmenting the experts into $mN$ ones and activating $mK$ from them, allowing for a more flexible combination of activated experts; (2) isolating $K_s$ experts as shared ones, aiming at capturing common knowledge and mitigating redundancy in routed experts. Starting from a modest scale with 2B parameters, we demonstrate that DeepSeekMoE 2B achieves comparable performance with GShard 2.9B, which has 1.5 times the expert parameters and computation. In addition, DeepSeekMoE 2B nearly approaches the performance of its dense counterpart with the same number of total parameters, which set the upper bound of MoE models. Subsequently, we scale up DeepSeekMoE to 16B parameters and show that it achieves comparable performance with LLaMA2 7B, with only about 40% of computations. Further, our preliminary efforts to scale up DeepSeekMoE to 145B parameters consistently validate its substantial advantages over the GShard architecture, and show its performance comparable with DeepSeek 67B, using only 28.5% (maybe even 18.2%) of computations.

翻译：在大语言模型时代，混合专家架构（MoE）是一种在扩展模型参数时有效管理计算成本的前沿架构。然而，传统的MoE架构（如GShard）在激活$N$个专家中的top-$K$个时，面临确保专家专业化的挑战，即每个专家应获取非重叠且聚焦的知识。为此，我们提出面向极致专家专业化的DeepSeekMoE架构，其包含两项核心策略：（1）将专家精细分割为$mN$个并激活其中$mK$个，从而实现更灵活的激活专家组合；（2）隔离$K_s$个专家作为共享专家，旨在捕获共性知识并缓解路由专家间的冗余。从20亿参数的初始规模出发，我们证明：DeepSeekMoE 2B以仅相当于GShard 2.9B（专家参数和计算量为其1.5倍）的性能水平运行。此外，DeepSeekMoE 2B几乎达到了参数总量相同的稠密模型的性能——后者是MoE模型的理论性能上限。随后，我们将DeepSeekMoE扩展至160亿参数，发现其以仅约40%的计算量即可达到与LLaMA2 7B相当的性能。进一步将DeepSeekMoE扩展至1450亿参数的初步实验持续验证了其对GShard架构的显著优势，并以仅28.5%（甚至可能低至18.2%）的计算量展现出与DeepSeek 67B相当的性能。