Mixture-of-Experts (MoEs) can scale up beyond traditional deep learning models by employing a routing strategy in which each input is processed by a single "expert" deep learning model. This strategy allows us to scale up the number of parameters defining the MoE while maintaining sparse activation, i.e., MoEs only load a small number of their total parameters into GPU VRAM for the forward pass depending on the input. In this paper, we provide an approximation and learning-theoretic analysis of mixtures of expert MLPs with (P)ReLU activation functions. We first prove that for every error level $\varepsilon>0$ and every Lipschitz function $f:[0,1]^n\to \mathbb{R}$, one can construct a MoMLP model (a Mixture-of-Experts comprising of (P)ReLU MLPs) which uniformly approximates $f$ to $\varepsilon$ accuracy over $[0,1]^n$, while only requiring networks of $\mathcal{O}(\varepsilon^{-1})$ parameters to be loaded in memory. Additionally, we show that MoMLPs can generalize since the entire MoMLP model has a (finite) VC dimension of $\tilde{O}(L\max\{nL,JW\})$, if there are $L$ experts and each expert has a depth and width of $J$ and $W$, respectively.
翻译:专家混合模型(MoEs)通过采用路由策略,能够突破传统深度学习模型的规模限制。在该策略中,每个输入仅由一个"专家"深度学习模型处理。这使得我们可以在保持稀疏激活的前提下扩展定义MoE的参数数量,即MoE在前向传播过程中仅根据输入将少量参数加载至GPU显存。本文对具有(P)ReLU激活函数的专家MLP混合模型进行了逼近理论和学习理论分析。我们首先证明:对于任意误差水平$\varepsilon>0$和任意Lipschitz函数$f:[0,1]^n\to \mathbb{R}$,均可构建一个MoMLP模型(由(P)ReLU MLP组成的专家混合模型),在$[0,1]^n$上以$\varepsilon$精度一致逼近$f$,且仅需将$\mathcal{O}(\varepsilon^{-1})$量级的网络参数载入内存。此外,我们证明MoMLP具有泛化能力:若存在$L$个专家且每个专家的深度和宽度分别为$J$和$W$,则整个MoMLP模型具有$\tilde{O}(L\max\{nL,JW\})$的(有限)VC维。