Mixture-of-Expert (MoE) models have obtained state-of-the-art performance in Neural Machine Translation (NMT) tasks. Existing works in MoE mostly consider a homogeneous design where the same number of experts of the same size are placed uniformly throughout the network. Furthermore, existing MoE works do not consider computational constraints (e.g., FLOPs, latency) to guide their design. To this end, we develop AutoMoE -- a framework for designing heterogeneous MoE's under computational constraints. AutoMoE leverages Neural Architecture Search (NAS) to obtain efficient sparse MoE sub-transformers with 4x inference speedup (CPU) and FLOPs reduction over manually designed Transformers, with parity in BLEU score over dense Transformer and within 1 BLEU point of MoE SwitchTransformer, on aggregate over benchmark datasets for NMT. Heterogeneous search space with dense and sparsely activated Transformer modules (e.g., how many experts? where to place them? what should be their sizes?) allows for adaptive compute -- where different amounts of computations are used for different tokens in the input. Adaptivity comes naturally from routing decisions which send tokens to experts of different sizes. AutoMoE code, data, and trained models are available at https://aka.ms/AutoMoE.
翻译:摘要:混合专家模型在神经机器翻译任务中取得了最先进的性能。现有混合专家研究大多采用同构设计,即在网络中均匀部署数量相同、规模相同的专家。此外,现有混合专家工作未考虑计算约束(如浮点运算次数、延迟)来指导其设计。为此,我们提出AutoMoE——一个在计算约束下设计异构混合专家模型的框架。AutoMoE利用神经架构搜索来获取高效的稀疏混合专家子Transformer,相较人工设计的Transformer,在CPU上实现了4倍推理加速并降低了浮点运算次数,同时在基准数据集聚合结果上与密集Transformer的BLEU得分持平,与MoE SwitchTransformer的BLEU得分差距在1分以内。包含密集和稀疏激活Transformer模块的异构搜索空间(例如,专家数量?放置位置?专家规模?)允许实现自适应计算——即对输入中不同词元分配不同计算量。自适应性源于将词元发送至不同规模专家的路由决策。AutoMoE的代码、数据和训练模型已开源至 https://aka.ms/AutoMoE。