Mixture of experts (MoE) methods are a key component in most large language model architectures, including the recent series of DeepSeek models. Compared to other MoE implementations, DeepSeekMoE stands out because of two unique features: the deployment of a shared expert strategy and of the normalized sigmoid gating mechanism. Despite the prominent role of DeepSeekMoE in the success of the DeepSeek series of models, there have been only a few attempts to justify theoretically the value of the shared expert strategy, while its normalized sigmoid gating has remained unexplored. To bridge this gap, we undertake a comprehensive theoretical study of these two features of DeepSeekMoE from a statistical perspective. We perform a convergence analysis of the expert estimation task to highlight the gains in sample efficiency for both the shared expert strategy and the normalized sigmoid gating, offering useful insights into the design of expert and gating structures. To verify empirically our theoretical findings, we carry out several experiments on both synthetic data and real-world datasets for (vision) language modeling tasks. Finally, we conduct an extensive empirical analysis of the router behaviors, ranging from router saturation, router change rate, to expert utilization.
翻译:专家混合(MoE)方法是大多数大语言模型架构的关键组成部分,包括近期DeepSeek系列模型。与其他MoE实现相比,DeepSeekMoE因其两个独特特征而突出:共享专家策略的部署与归一化Sigmoid门控机制的应用。尽管DeepSeekMoE在DeepSeek系列模型的成功中扮演重要角色,目前仅存在少量理论尝试论证共享专家策略的价值,而其归一化Sigmoid门控机制尚未得到理论探索。为填补这一空白,我们从统计视角对DeepSeekMoE的这两个特征进行了全面理论研究。通过对专家估计任务进行收敛性分析,我们揭示了共享专家策略与归一化Sigmoid门控在样本效率方面的增益,为专家与门控结构的设计提供了有益洞见。为实证验证理论发现,我们在合成数据与真实世界数据集上开展了多项(视觉)语言建模任务实验。最后,我们对路由器的行为进行了广泛实证分析,涵盖路由器饱和度、路由器变化率及专家利用率等多个维度。