Advancing MoE Efficiency: A Collaboration-Constrained Routing (C2R) Strategy for Better Expert Parallelism Design

Mixture-of-Experts (MoE) has successfully scaled up models while maintaining nearly constant computing costs. By employing a gating network to route input tokens, it selectively activates a subset of expert networks to process the corresponding token embeddings. However, in practice, the efficiency of MoE is challenging to achieve due to two key reasons: imbalanced expert activation, which leads to substantial idle time during model or expert parallelism, and insufficient capacity utilization; massive communication overhead, induced by numerous expert routing combinations in expert parallelism at the system level. Previous works typically formulate it as the load imbalance issue characterized by the gating network favoring certain experts over others or attribute it to static execution which fails to adapt to the dynamic expert workload at runtime. In this paper, we exploit it from a brand new perspective, a higher-order view and analysis of MoE routing policies: expert collaboration and specialization where some experts tend to activate broadly with others (collaborative), while others are more likely to activate only with a specific subset of experts (specialized). Our experiments reveal that most experts tend to be overly collaborative, leading to increased communication overhead from repeatedly sending tokens to different accelerators. To this end, we propose a novel collaboration-constrained routing (C2R) strategy to encourage more specialized expert groups, as well as to improve expert utilization, and present an efficient implementation of MoE that further leverages expert specialization. We achieve an average performance improvement of 0.51% and 0.33% on LLaMA-MoE and Qwen-MoE respectively across ten downstream NLP benchmarks, and reduce the all2all communication costs between GPUs, bringing an extra 20%-30% total running time savings on top of the existing SoTA, i.e. MegaBlocks.

翻译：混合专家（Mixture-of-Experts，MoE）模型已成功实现了模型规模的扩展，同时保持了近乎恒定的计算成本。它通过使用门控网络来路由输入标记，选择性地激活一个专家网络子集来处理相应的标记嵌入。然而，在实践中，MoE的效率难以实现，主要基于两个关键原因：一是专家激活不均衡，这导致在模型或专家并行过程中产生大量空闲时间以及容量利用率不足；二是巨大的通信开销，由系统层面专家并行中大量的专家路由组合所引发。先前的研究通常将其表述为门控网络偏好某些专家而导致的负载不均衡问题，或归因于静态执行无法适应运行时动态的专家工作负载。在本文中，我们从全新的视角——一个更高阶的视角来探索和分析MoE路由策略：专家协作与专业化，即某些专家倾向于广泛地与其他专家共同激活（协作型），而另一些专家则更可能仅与特定的专家子集共同激活（专业型）。我们的实验表明，大多数专家往往过度协作，导致因重复将标记发送至不同加速器而增加的通信开销。为此，我们提出了一种新颖的协作约束路由（Collaboration-Constrained Routing，C2R）策略，以鼓励形成更专业化的专家分组，同时提升专家利用率，并提出了一种进一步利用专家专业化的高效MoE实现方案。我们在十个下游NLP基准测试中，分别在LLaMA-MoE和Qwen-MoE上实现了平均0.51%和0.33%的性能提升，并降低了GPU间的all2all通信成本，在现有最先进技术（即MegaBlocks）的基础上，额外带来了20%-30%的总运行时间节省。

相关内容

Networking

关注 23

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日