Mixtures of Experts (MoE) are Machine Learning models that involve partitioning the input space, with a separate "expert" model trained on each partition. Recently, MoE have become popular as components in today's large language models as a means to reduce training and inference costs. There, the partitioning function and the experts are both learnt jointly via gradient descent on the log-likelihood. In this paper we focus on studying the efficiency of the Expectation Maximization (EM) algorithm for the training of MoE models. We first rigorously analyze EM for the cases of linear or logistic experts, where we show that EM is equivalent to Mirror Descent with unit step size and a Kullback-Leibler Divergence regularizer. This perspective allows us to derive new convergence results and identify conditions for local linear convergence based on the signal-to-noise ratio (SNR). Experiments on synthetic and (small-scale) real-world data show that EM outperforms the gradient descent algorithm both in terms of convergence rate and the achieved accuracy.
翻译:混合专家模型是一种机器学习模型,其核心思想是将输入空间进行划分,并在每个分区上训练一个独立的“专家”模型。近年来,混合专家模型作为降低大型语言模型训练与推理成本的一种手段,已成为当前大语言模型中的重要组成部分。在这些模型中,划分函数与专家模型通常通过对数似然的梯度下降法进行联合学习。本文重点研究期望最大化算法在混合专家模型训练中的效率。我们首先对线性专家或逻辑回归专家情形下的EM算法进行了严格分析,证明其等价于步长为1、并带有Kullback-Leibler散度正则项的镜像下降法。这一视角使我们能够推导出新的收敛性结果,并根据信噪比条件确定局部线性收敛的成立条件。在合成数据与(小规模)真实数据上的实验表明,EM算法在收敛速度和达到的精度方面均优于梯度下降算法。