Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with Architecture-Routed Mixture-of-Experts

Ganesh Jawahar,Haichuan Yang,Yunyang Xiong,Zechun Liu,Dilin Wang,Fei Sun,Meng Li,Aasish Pappu,Barlas Oguz,Muhammad Abdul-Mageed,Laks V. S. Lakshmanan,Raghuraman Krishnamoorthi,Vikas Chandra

Weight-sharing supernet has become a vital component for performance estimation in the state-of-the-art (SOTA) neural architecture search (NAS) frameworks. Although supernet can directly generate different subnetworks without retraining, there is no guarantee for the quality of these subnetworks because of weight sharing. In NLP tasks such as machine translation and pre-trained language modeling, we observe that given the same model architecture, there is a large performance gap between supernet and training from scratch. Hence, supernet cannot be directly used and retraining is necessary after finding the optimal architectures. In this work, we propose mixture-of-supernets, a generalized supernet formulation where mixture-of-experts (MoE) is adopted to enhance the expressive power of the supernet model, with negligible training overhead. In this way, different subnetworks do not share the model weights directly, but through an architecture-based routing mechanism. As a result, model weights of different subnetworks are customized towards their specific architectures and the weight generation is learned by gradient descent. Compared to existing weight-sharing supernet for NLP, our method can minimize the retraining time, greatly improving training efficiency. In addition, the proposed method achieves the SOTA performance in NAS for building fast machine translation models, yielding better latency-BLEU tradeoff compared to HAT, state-of-the-art NAS for MT. We also achieve the SOTA performance in NAS for building memory-efficient task-agnostic BERT models, outperforming NAS-BERT and AutoDistil in various model sizes.

翻译：权重共享超网络已成为当前最先进的神经网络架构搜索（NAS）框架中性能评估的关键组件。尽管超网络可直接生成不同子网络而无需重新训练，但由于权重共享机制，这些子网络的质量无法得到保证。在机器翻译和预训练语言建模等NLP任务中，我们观察到相同模型架构下，超网络与从头训练之间存在显著性能差距。因此，超网络无法直接使用，在找到最优架构后仍需重新训练。本文提出混合超网络（mixture-of-supernets），这是一种广义超网络形式，通过引入混合专家（MoE）机制在几乎不增加训练开销的前提下增强超网络的表达能力。在该方法中，不同子网络不直接共享模型权重，而是通过基于架构的路由机制实现权重分配。由此，不同子网络的模型权重可针对其特定架构进行定制化生成，且权重生成过程通过梯度下降学习得到。相较于现有NLP领域的权重共享超网络，本方法可最大程度减少重新训练时间，显著提升训练效率。此外，所提方法在构建快速机器翻译模型的NAS任务中达到最先进性能，相较于当前最优的机器翻译NAS方法HAT，实现了更优的延迟- BLEU值权衡。在构建内存高效的任务无关BERT模型的NAS任务中，本方法同样取得最优性能，在不同模型规模下均优于NAS-BERT和AutoDistil。