HEXA-MoE: Efficient and Heterogeneous-aware MoE Acceleration with ZERO Computation Redundancy

Mixture-of-Experts (MoE) has emerged as a practical approach to scale up parameters for the Transformer model to achieve better generalization while maintaining a sub-linear increase in computation overhead. Current MoE models are mainly built with expert parallelism on distributed devices. However, it usually depends on homogeneous devices to deploy and suffers from heavy communication overhead and computation redundancy. In this paper, we explore developing a \texttt{H}eterogeneous-aware \texttt{EX}pert \texttt{A}llocation framework, \textbf{\texttt{HEXA-MoE}}, with significantly enhanced computing efficiency. It contains two components: ($1$) \textit{Expert-Specific Operators}. We replace the typical general matrix multiplication or grouped matrix multiplication interfaces with our operators, which allows the computing to be performed in an in-place manner with \textbf{ZERO} redundancy. ($2$) \textit{Adaptive Data- and Model-Centric Configurations} for different workload scales. Specifically, we introduce a pipeline-shared cache on each device to tackle the heavy memory consumption in the existing data-centric MoE library. Comprehensive experiments on the Swin-MoE benchmark consistently reveal the effectiveness of our \texttt{HEXA-MoE} framework, i.e., reducing $10\%\sim48\%$ memory consumption and achieving $0.5\sim4.3\times$ speed up compared to current state-of-the-art MoE libraries. Furthermore, we examine our \texttt{HEXA-MoE} with heterogeneous devices for both data- and model-centric settings. Promising results show that employing optimal parallel configuration with \texttt{HEXA-MoE} on heterogeneous devices can substantially minimize overall latency. Codes are available at https://github.com/UNITES-Lab/HEXA-MoE.

翻译：混合专家模型已成为扩展Transformer模型参数以实现更优泛化能力，同时保持计算开销亚线性增长的一种实用方法。当前MoE模型主要基于分布式设备上的专家并行构建。然而，该方法通常依赖同构设备进行部署，并面临沉重的通信开销与计算冗余问题。本文探索开发了一种异构感知的专家分配框架——HEXA-MoE，其计算效率显著提升。该框架包含两个核心组件：（1）专家专用算子。我们将典型的通用矩阵乘法或分组矩阵乘法接口替换为专用算子，使得计算能够以零冗余的就地方式执行。（2）面向不同工作负载规模的自适应数据与模型中心配置。具体而言，我们在每个设备上引入流水线共享缓存，以应对现有数据中心MoE库中沉重的内存消耗问题。在Swin-MoE基准测试上的全面实验一致证明了HEXA-MoE框架的有效性：相较于当前最先进的MoE库，内存消耗降低10%∼48%，并实现0.5∼4.3倍的加速。此外，我们在异构设备上对HEXA-MoE进行了数据与模型中心两种设置的验证。令人鼓舞的结果表明，在异构设备上采用HEXA-MoE的最优并行配置能显著降低整体延迟。代码已发布于https://github.com/UNITES-Lab/HEXA-MoE。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日