Dense-to-sparse gating mixture of experts (MoE) has recently become an effective alternative to a well-known sparse MoE. Rather than fixing the number of activated experts as in the latter model, which could limit the investigation of potential experts, the former model utilizes the temperature to control the softmax weight distribution and the sparsity of the MoE during training in order to stabilize the expert specialization. Nevertheless, while there are previous attempts to theoretically comprehend the sparse MoE, a comprehensive analysis of the dense-to-sparse gating MoE has remained elusive. Therefore, we aim to explore the impacts of the dense-to-sparse gate on the maximum likelihood estimation under the Gaussian MoE in this paper. We demonstrate that due to interactions between the temperature and other model parameters via some partial differential equations, the convergence rates of parameter estimations are slower than any polynomial rates, and could be as slow as $\mathcal{O}(1/\log(n))$, where $n$ denotes the sample size. To address this issue, we propose using a novel activation dense-to-sparse gate, which routes the output of a linear layer to an activation function before delivering them to the softmax function. By imposing linearly independence conditions on the activation function and its derivatives, we show that the parameter estimation rates are significantly improved to polynomial rates. Finally, we conduct a simulation study to empirically validate our theoretical results.
翻译:稠密到稀疏门控混合专家模型(MoE)近来已成为知名稀疏MoE的有效替代方案。与后者固定激活专家数量(这可能限制对潜在专家的探索)不同,前者利用温度来控制训练过程中softmax权重分布和MoE的稀疏性,从而稳定专家专业化。然而,尽管先前已有尝试从理论上理解稀疏MoE,但对稠密到稀疏门控MoE的全面分析仍然难以实现。因此,本文旨在探索高斯MoE框架下稠密到稀疏门对极大似然估计的影响。我们证明,由于温度与其他模型参数通过某些偏微分方程产生相互作用,参数估计的收敛速率慢于任何多项式速率,可能低至$\mathcal{O}(1/\log(n))$,其中$n$表示样本量。为解决此问题,我们提出使用一种新颖的激活稠密到稀疏门,该门将线性层的输出路由至激活函数后再送入softmax函数。通过对激活函数及其导数施加线性独立性条件,我们证明参数估计速率显著提升至多项式速率。最后,我们通过模拟研究实证验证了理论结果。