Mixture of experts (MoE) methods are a key component in most large language model architectures, including the recent series of DeepSeek models. Compared to other MoE implementations, DeepSeekMoE stands out because of two unique features: the deployment of a shared expert strategy and of the normalized sigmoid gating mechanism. Despite the prominent role of DeepSeekMoE in the success of the DeepSeek series of models, there have been only a few attempts to justify theoretically the value of the shared expert strategy, while its normalized sigmoid gating has remained unexplored. To bridge this gap, we undertake a comprehensive theoretical study of these two features of DeepSeekMoE from a statistical perspective. We perform a convergence analysis of the expert estimation task to highlight the gains in sample efficiency for both the shared expert strategy and the normalized sigmoid gating, offering useful insights into the design of expert and gating structures. To verify empirically our theoretical findings, we carry out several experiments on both synthetic data and real-world datasets for (vision) language modeling tasks. Finally, we conduct an extensive empirical analysis of the router behaviors, ranging from router saturation, router change rate, to expert utilization.
翻译:专家混合(MoE)方法是大多数大语言模型架构的关键组成部分,包括近期发布的DeepSeek系列模型。与其他MoE实现相比,DeepSeekMoE因两项独特设计而突出:共享专家策略的部署与归一化Sigmoid门控机制的应用。尽管DeepSeekMoE对DeepSeek系列模型的成功具有重要作用,目前仅存在少量理论工作论证共享专家策略的价值,而其归一化Sigmoid门控机制尚未得到理论探讨。为填补这一空白,我们从统计视角对DeepSeekMoE的这两个特性展开了系统的理论研究。通过对专家估计任务进行收敛性分析,我们揭示了共享专家策略与归一化Sigmoid门控在样本效率方面的提升,为专家结构与门控机制的设计提供了重要见解。为实证验证理论发现,我们在合成数据与真实数据集上开展了多组面向(视觉)语言建模任务的实验。最后,我们对路由器的行为进行了全面的实证分析,涵盖路由器饱和度、路由器变化率及专家利用率等多个维度。