Machine learning models based on the aggregated outputs of submodels, either at the activation or prediction levels, often exhibit strong performance compared to individual models. We study the interplay of two popular classes of such models: ensembles of neural networks and sparse mixture of experts (sparse MoEs). First, we show that the two approaches have complementary features whose combination is beneficial. This includes a comprehensive evaluation of sparse MoEs in uncertainty related benchmarks. Then, we present Efficient Ensemble of Experts (E$^3$), a scalable and simple ensemble of sparse MoEs that takes the best of both classes of models, while using up to 45% fewer FLOPs than a deep ensemble. Extensive experiments demonstrate the accuracy, log-likelihood, few-shot learning, robustness, and uncertainty improvements of E$^3$ over several challenging vision Transformer-based baselines. E$^3$ not only preserves its efficiency while scaling to models with up to 2.7B parameters, but also provides better predictive performance and uncertainty estimates for larger models.
翻译:基于子模型聚合输出的机器学习模型(无论是在激活层还是预测层面),相较于单一模型通常展现出更强的性能。本文研究了两类此类模型的相互作用:神经网络集成模型与稀疏混合专家模型(sparse MoEs)。首先,我们证明这两种方法具有互补特性,其结合可产生增益。这包括在不确定性相关基准测试中对稀疏MoEs的全面评估。随后,我们提出了高效专家集成模型(E$^3$),这是一种可扩展且简洁的稀疏MoEs集成方法,它融合了两类模型的优势,同时相比深度集成模型可减少高达45%的运算量。大量实验表明,E$^3$在多个具有挑战性的视觉Transformer基线上,在准确性、对数似然、少样本学习、鲁棒性和不确定性方面均有显著提升。E$^3$不仅能在扩展至含27亿参数的模型时保持效率,还能为更大模型提供更优的预测性能与不确定性估计。