COMET: Learning Cardinality Constrained Mixture of Experts with Trees and Local Search

The sparse Mixture-of-Experts (Sparse-MoE) framework efficiently scales up model capacity in various domains, such as natural language processing and vision. Sparse-MoEs select a subset of the "experts" (thus, only a portion of the overall network) for each input sample using a sparse, trainable gate. Existing sparse gates are prone to convergence and performance issues when training with first-order optimization methods. In this paper, we introduce two improvements to current MoE approaches. First, we propose a new sparse gate: COMET, which relies on a novel tree-based mechanism. COMET is differentiable, can exploit sparsity to speed up computation, and outperforms state-of-the-art gates. Second, due to the challenging combinatorial nature of sparse expert selection, first-order methods are typically prone to low-quality solutions. To deal with this challenge, we propose a novel, permutation-based local search method that can complement first-order methods in training any sparse gate, e.g., Hash routing, Top-k, DSelect-k, and COMET. We show that local search can help networks escape bad initializations or solutions. We performed large-scale experiments on various domains, including recommender systems, vision, and natural language processing. On standard vision and recommender systems benchmarks, COMET+ (COMET with local search) achieves up to 13% improvement in ROC AUC over popular gates, e.g., Hash routing and Top-k, and up to 9% over prior differentiable gates e.g., DSelect-k. When Top-k and Hash gates are combined with local search, we see up to $100\times$ reduction in the budget needed for hyperparameter tuning. Moreover, for language modeling, our approach improves over the state-of-the-art MoEBERT model for distilling BERT on 5/7 GLUE benchmarks as well as SQuAD dataset.

翻译：摘要：稀疏混合专家模型（Sparse-MoE）框架可高效扩展各类领域（如自然语言处理和视觉任务）的模型容量。Sparse-MoE通过可训练的稀疏门控机制为每个输入样本选择部分"专家"（即整体网络的一部分）。然而，现有稀疏门控在使用一阶优化方法训练时容易产生收敛性与性能问题。本文对当前MoE方法提出两项改进。首先，我们提出新型稀疏门控COMET，其核心基于树结构机制：该门控可微、能利用稀疏性加速计算，且性能优于现有最优门控。其次，由于稀疏专家选择具有高度组合优化特性，一阶方法通常难以获得高质量解。为应对这一挑战，我们创新性地提出基于排列的局部搜索方法，该方法可补充一阶方法训练任意稀疏门控（如Hash routing、Top-k、DSelect-k及COMET）。实验表明，局部搜索能帮助网络逃离不良初始化或次优解。我们在推荐系统、视觉和自然语言处理等多个领域开展了大规模实验。在标准视觉与推荐系统基准测试中，COMET+（集成局部搜索的COMET）在ROC AUC指标上较主流门控（如Hash routing、Top-k）提升最高13%，较先前可微门控（如DSelect-k）提升最高9%。当Top-k和Hash门控与局部搜索结合时，超参数调优所需预算降低高达100倍。此外，在语言建模任务中，我们的方法在5/7项GLUE基准及SQuAD数据集上优于当前最优的MoEBERT模型（用于BERT蒸馏的改进模型）。