The cosine router in Mixture of Experts (MoE) has recently emerged as an attractive alternative to the conventional linear router. Indeed, the cosine router demonstrates favorable performance in image and language tasks and exhibits better ability to mitigate the representation collapse issue, which often leads to parameter redundancy and limited representation potentials. Despite its empirical success, a comprehensive analysis of the cosine router in MoE has been lacking. Considering the least square estimation of the cosine routing MoE, we demonstrate that due to the intrinsic interaction of the model parameters in the cosine router via some partial differential equations, regardless of the structures of the experts, the estimation rates of experts and model parameters can be as slow as $\mathcal{O}(1/\log^{\tau}(n))$ where $\tau > 0$ is some constant and $n$ is the sample size. Surprisingly, these pessimistic non-polynomial convergence rates can be circumvented by the widely used technique in practice to stabilize the cosine router -- simply adding noises to the $L^2$ norms in the cosine router, which we refer to as \textit{perturbed cosine router}. Under the strongly identifiable settings of the expert functions, we prove that the estimation rates for both the experts and model parameters under the perturbed cosine routing MoE are significantly improved to polynomial rates. Finally, we conduct extensive simulation studies in both synthetic and real data settings to empirically validate our theoretical results.
翻译:专家混合模型(MoE)中的余弦路由最近已成为传统线性路由的一个有吸引力替代方案。实际上,余弦路由在图像和语言任务中表现出优越性能,并展现出更好的缓解表示坍塌问题的能力,该问题通常会导致参数冗余和表示潜力受限。尽管取得了实证成功,但针对MoE中余弦路由的全面分析仍然缺乏。考虑余弦路由MoE的最小二乘估计,我们证明:由于余弦路由中模型参数通过某些偏微分方程产生的内在相互作用,无论专家的结构如何,专家和模型参数的估计速率可能低至$\mathcal{O}(1/\log^{\tau}(n))$,其中$\tau > 0$为常数,$n$为样本量。令人惊讶的是,这些悲观的非多项式收敛速率可以通过实践中广泛使用的余弦路由稳定技术来规避——简单地向余弦路由中的$L^2$范数添加噪声,我们称之为\textit{扰动余弦路由}。在专家函数强可识别的设定下,我们证明扰动余弦路由MoE中专家和模型参数的估计速率均显著提升至多项式速率。最后,我们在合成数据和真实数据设定下进行了广泛的模拟研究,以实证验证我们的理论结果。