$\texttt{HEXA-MoE}$：零计算冗余的高效异构感知MoE加速 ($\texttt{HEXA-MoE}$: Efficient and Heterogeneous-aware MoE Acceleration with ZERO Computation Redundancy)

Mixture-of-Experts (MoE) has emerged as a practical approach to scale up parameters for the Transformer model to achieve better generalization while maintaining a sub-linear increase in computation overhead. Current MoE models are mainly built with expert parallelism on distributed devices. However, it usually depends on homogeneous devices to deploy and suffers from heavy communication overhead and computation redundancy. In this paper, we explore developing a \texttt{H}eterogeneous-aware \texttt{EX}pert \texttt{A}llocation framework, \textbf{\texttt{HEXA-MoE}}, with significantly enhanced computing efficiency. It contains two components: ($1$) \textit{Expert-Specific Operators}. We replace the typical general matrix multiplication or grouped matrix multiplication interfaces with our operators, which allows the computing to be performed in an in-place manner with \textbf{ZERO} redundancy. ($2$) \textit{Adaptive Data- and Model-Centric Configurations} for different workload scales. Specifically, we introduce a pipeline-shared cache on each device to tackle the heavy memory consumption in the existing data-centric MoE library. Comprehensive experiments on the Swin-MoE benchmark consistently reveal the effectiveness of our \texttt{HEXA-MoE} framework, \textit{i.e.}, reducing $10\%\sim48\%$ memory consumption and achieving $0.5\sim4.3\times$ speed up compared to current state-of-the-art MoE libraries. Furthermore, we examine our \texttt{HEXA-MoE} with heterogeneous devices for both data- and model-centric settings. Promising results show that employing optimal parallel configuration with \texttt{HEXA-MoE} on heterogeneous devices can substantially minimize overall latency. Codes are available at \href{https://github.com/UNITES-Lab/HEXA-MoE}{\underline{here}}.

翻译：混合专家（Mixture-of-Experts, MoE）已成为扩展Transformer模型参数以实现更好泛化能力，同时保持计算开销次线性增长的一种实用方法。当前的MoE模型主要基于分布式设备上的专家并行性构建。然而，它通常依赖于同构设备进行部署，并存在沉重的通信开销和计算冗余。本文探索开发一种具有显著提升计算效率的\texttt{异构感知专家分配}框架，即\textbf{\texttt{HEXA-MoE}}。它包含两个组件：（$1$）\textit{专家特定算子}。我们用我们的算子取代了典型的通用矩阵乘法或分组矩阵乘法接口，这使得计算能够以\textbf{零}冗余的就地方式进行。（$2$）针对不同工作负载规模的\textit{自适应数据与模型中心配置}。具体来说，我们在每个设备上引入了流水线共享缓存，以应对现有数据中心MoE库中沉重的内存消耗。在Swin-MoE基准测试上的全面实验一致地揭示了我们的\texttt{HEXA-MoE}框架的有效性，\textit{即}，与当前最先进的MoE库相比，减少了$10\%\sim48\%$的内存消耗，并实现了$0.5\sim4.3$倍的加速。此外，我们在异构设备上针对数据中心和模型中心两种设置测试了我们的\texttt{HEXA-MoE}。有希望的结果表明，在异构设备上使用\texttt{HEXA-MoE}的最优并行配置可以显著最小化总体延迟。代码可在\href{https://github.com/UNITES-Lab/HEXA-MoE}{\underline{此处}}获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日