The sigmoid gate in mixture-of-experts (MoE) models has been empirically shown to outperform the softmax gate across several tasks, ranging from approximating feed-forward networks to language modeling. Additionally, recent efforts have demonstrated that the sigmoid gate is provably more sample-efficient than its softmax counterpart under regression settings. Nevertheless, there are three notable concerns that have not been addressed in the literature, namely (i) the benefits of the sigmoid gate have not been established under classification settings; (ii) existing sigmoid-gated MoE models may not converge to their ground-truth; and (iii) the effects of a temperature parameter in the sigmoid gate remain theoretically underexplored. To tackle these open problems, we perform a comprehensive analysis of multinomial logistic MoE equipped with a modified sigmoid gate to ensure model convergence. Our results indicate that the sigmoid gate exhibits a lower sample complexity than the softmax gate for both parameter and expert estimation. Furthermore, we find that incorporating a temperature into the sigmoid gate leads to a sample complexity of exponential order due to an intrinsic interaction between the temperature and gating parameters. To overcome this issue, we propose replacing the vanilla inner product score in the gating function with a Euclidean score that effectively removes that interaction, thereby substantially improving the sample complexity to a polynomial order.
翻译:在专家混合模型中,Sigmoid门控函数经实证研究证明,在从近似前馈网络到语言建模的多个任务中均优于Softmax门控。此外,近期研究还表明,在回归任务设定下,Sigmoid门控在理论上被证明比Softmax门控具有更高的样本效率。然而,现有文献中仍存在三个尚未解决的重要问题:(i)Sigmoid门控在分类任务设定下的优势尚未得到证实;(ii)现有采用Sigmoid门控的专家混合模型可能无法收敛到其真实参数;(iii)Sigmoid门控中温度参数的影响在理论上仍未得到充分探索。为解决这些开放性问题,我们对配备改进型Sigmoid门控(以确保模型收敛)的多项式逻辑专家混合模型进行了全面分析。研究结果表明,在参数估计和专家估计两方面,Sigmoid门控均展现出比Softmax门控更低的样本复杂度。此外,我们发现由于温度参数与门控参数之间存在内在交互作用,在Sigmoid门控中引入温度参数会导致样本复杂度呈指数级增长。为解决此问题,我们提出将门控函数中的标准内积评分替换为欧几里得评分,该评分能有效消除上述交互作用,从而将样本复杂度显著降低至多项式量级。