Mixture-of-Experts (MoE) has emerged as a practical approach to scale up parameters for the Transformer model to achieve better generalization while maintaining a sub-linear increase in computation overhead. Current MoE models are mainly built with expert parallelism on distributed devices. However, it usually depends on homogeneous devices to deploy and suffers from heavy communication overhead and computation redundancy. In this paper, we explore developing a \texttt{H}eterogeneous-aware \texttt{EX}pert \texttt{A}llocation framework, \textbf{\texttt{HEXA-MoE}}, with significantly enhanced computing efficiency. It contains two components: ($1$) \textit{Expert-Specific Operators}. We replace the typical general matrix multiplication or grouped matrix multiplication interfaces with our operators, which allows the computing to be performed in an in-place manner with \textbf{ZERO} redundancy. ($2$) \textit{Adaptive Data- and Model-Centric Configurations} for different workload scales. Specifically, we introduce a pipeline-shared cache on each device to tackle the heavy memory consumption in the existing data-centric MoE library. Comprehensive experiments on the Swin-MoE benchmark consistently reveal the effectiveness of our \texttt{HEXA-MoE} framework, \textit{i.e.}, reducing $10\%\sim48\%$ memory consumption and achieving $0.5\sim4.3\times$ speed up compared to current state-of-the-art MoE libraries. Furthermore, we examine our \texttt{HEXA-MoE} with heterogeneous devices for both data- and model-centric settings. Promising results show that employing optimal parallel configuration with \texttt{HEXA-MoE} on heterogeneous devices can substantially minimize overall latency. Codes are available at \href{https://github.com/UNITES-Lab/HEXA-MoE}{\underline{here}}.
翻译:混合专家(Mixture-of-Experts, MoE)已成为扩展Transformer模型参数以实现更好泛化能力,同时保持计算开销次线性增长的一种实用方法。当前的MoE模型主要基于分布式设备上的专家并行性构建。然而,它通常依赖于同构设备进行部署,并存在沉重的通信开销和计算冗余。本文探索开发一种具有显著提升计算效率的\texttt{异构感知专家分配}框架,即\textbf{\texttt{HEXA-MoE}}。它包含两个组件:($1$)\textit{专家特定算子}。我们用我们的算子取代了典型的通用矩阵乘法或分组矩阵乘法接口,这使得计算能够以\textbf{零}冗余的就地方式进行。($2$)针对不同工作负载规模的\textit{自适应数据与模型中心配置}。具体来说,我们在每个设备上引入了流水线共享缓存,以应对现有数据中心MoE库中沉重的内存消耗。在Swin-MoE基准测试上的全面实验一致地揭示了我们的\texttt{HEXA-MoE}框架的有效性,\textit{即},与当前最先进的MoE库相比,减少了$10\%\sim48\%$的内存消耗,并实现了$0.5\sim4.3$倍的加速。此外,我们在异构设备上针对数据中心和模型中心两种设置测试了我们的\texttt{HEXA-MoE}。有希望的结果表明,在异构设备上使用\texttt{HEXA-MoE}的最优并行配置可以显著最小化总体延迟。代码可在\href{https://github.com/UNITES-Lab/HEXA-MoE}{\underline{此处}}获取。