Mixture of experts (MoE) model is a statistical machine learning design that aggregates multiple expert networks using a softmax gating function in order to form a more intricate and expressive model. Despite being commonly used in several applications owing to their scalability, the mathematical and statistical properties of MoE models are complex and difficult to analyze. As a result, previous theoretical works have primarily focused on probabilistic MoE models by imposing the impractical assumption that the data are generated from a Gaussian MoE model. In this work, we investigate the performance of the least squares estimators (LSE) under a deterministic MoE model where the data are sampled according to a regression model, a setting that has remained largely unexplored. We establish a condition called strong identifiability to characterize the convergence behavior of various types of expert functions. We demonstrate that the rates for estimating strongly identifiable experts, namely the widely used feed-forward networks with activation functions $\mathrm{sigmoid}(\cdot)$ and $\tanh(\cdot)$, are substantially faster than those of polynomial experts, which we show to exhibit a surprising slow estimation rate. Our findings have important practical implications for expert selection.
翻译:混合专家模型是一种统计机器学习架构,它通过软门控函数聚合多个专家网络,以构建更复杂且表达能力更强的模型。尽管因其可扩展性而在多个应用领域被广泛采用,但混合专家模型的数学与统计特性复杂且难以分析。因此,先前的理论工作主要集中在概率混合专家模型上,并施加了数据由高斯混合专家模型生成这一不切实际的假设。在本研究中,我们探究了确定性混合专家模型下最小二乘估计器的性能,其中数据依据回归模型采样——这一设定在很大程度上尚未被探索。我们建立了一个称为强可识别性的条件,以刻画各类专家函数的收敛行为。我们证明,对于强可识别专家(即广泛使用的具有 $\mathrm{sigmoid}(\cdot)$ 和 $\tanh(\cdot)$ 激活函数的前馈网络)的估计速率显著快于多项式专家,而后者表现出令人惊讶的缓慢估计速率。我们的发现对专家选择具有重要的实际意义。