Originally introduced as a neural network for ensemble learning, mixture of experts (MoE) has recently become a fundamental building block of highly successful modern deep neural networks for heterogeneous data analysis in several applications of machine learning and statistics. Despite its popularity in practice, a satisfactory level of theoretical understanding of the MoE model is far from complete. To shed new light on this problem, we provide a convergence analysis for maximum likelihood estimation (MLE) in the Gaussian-gated MoE model. The main challenge of that analysis comes from the inclusion of covariates in the Gaussian gating functions and expert networks, which leads to their intrinsic interaction via some partial differential equations with respect to their parameters. We tackle these issues by designing novel Voronoi loss functions among parameters to accurately capture the heterogeneity of parameter estimation rates. Our findings reveal that the MLE has distinct behaviors under two complement settings of location parameters of the Gaussian gating functions, namely when all these parameters are non-zero versus when at least one among them vanishes. Notably, these behaviors can be characterized by the solvability of two different systems of polynomial equations. Finally, we conduct a simulation study to empirically verify our theoretical results.
翻译:最初作为集成学习的神经网络引入,混合专家模型(MoE)最近已成为机器学习与统计学中多个异构数据分析应用领域取得巨大成功的现代深度神经网络的基础构建模块。尽管该模型在实践中应用广泛,但其理论理解尚未达到令人满意的水平。为这一难题提供新见解,我们针对高斯门控混合专家模型中的最大似然估计(MLE)进行了收敛性分析。该分析的主要挑战源于在高斯门控函数与专家网络中纳入协变量,这导致参数之间通过某些偏微分方程产生内在交互作用。我们通过设计参数间新型Voronoi损失函数来解决这些问题,以精确捕捉参数估计速率的异质性。研究发现,在高斯门控函数位置参数的两种互补设定下(即所有参数非零与至少一个参数为零),MLE呈现出截然不同的行为特征。值得关注的是,这些行为特征可通过两种多项式方程组解的存在性来刻画。最后,我们通过仿真研究实证验证了理论结果。