Mixture-of-Experts (MoE) has emerged as a practical approach to scale up parameters for the Transformer model to achieve better generalization while maintaining a sub-linear increase in computation overhead. Current MoE models are mainly built with expert parallelism on distributed devices. However, it usually depends on homogeneous devices to deploy and suffers from heavy communication overhead and computation redundancy. In this paper, we explore developing a \texttt{H}eterogeneous-aware \texttt{EX}pert \texttt{A}llocation framework, \textbf{\texttt{HEXA-MoE}}, with significantly enhanced computing efficiency. It contains two components: ($1$) \textit{Expert-Specific Operators}. We replace the typical general matrix multiplication or grouped matrix multiplication interfaces with our operators, which allows the computing to be performed in an in-place manner with \textbf{ZERO} redundancy. ($2$) \textit{Adaptive Data- and Model-Centric Configurations} for different workload scales. Specifically, we introduce a pipeline-shared cache on each device to tackle the heavy memory consumption in the existing data-centric MoE library. Comprehensive experiments on the Swin-MoE benchmark consistently reveal the effectiveness of our \texttt{HEXA-MoE} framework, i.e., reducing $10\%\sim48\%$ memory consumption and achieving $0.5\sim4.3\times$ speed up compared to current state-of-the-art MoE libraries. Furthermore, we examine our \texttt{HEXA-MoE} with heterogeneous devices for both data- and model-centric settings. Promising results show that employing optimal parallel configuration with \texttt{HEXA-MoE} on heterogeneous devices can substantially minimize overall latency. Codes are available at https://github.com/UNITES-Lab/HEXA-MoE.
翻译:混合专家模型已成为扩展Transformer模型参数以实现更优泛化能力,同时保持计算开销亚线性增长的一种实用方法。当前MoE模型主要基于分布式设备上的专家并行构建。然而,该方法通常依赖同构设备进行部署,并面临沉重的通信开销与计算冗余问题。本文探索开发了一种异构感知的专家分配框架——HEXA-MoE,其计算效率显著提升。该框架包含两个核心组件:(1)专家专用算子。我们将典型的通用矩阵乘法或分组矩阵乘法接口替换为专用算子,使得计算能够以零冗余的就地方式执行。(2)面向不同工作负载规模的自适应数据与模型中心配置。具体而言,我们在每个设备上引入流水线共享缓存,以应对现有数据中心MoE库中沉重的内存消耗问题。在Swin-MoE基准测试上的全面实验一致证明了HEXA-MoE框架的有效性:相较于当前最先进的MoE库,内存消耗降低10%∼48%,并实现0.5∼4.3倍的加速。此外,我们在异构设备上对HEXA-MoE进行了数据与模型中心两种设置的验证。令人鼓舞的结果表明,在异构设备上采用HEXA-MoE的最优并行配置能显著降低整体延迟。代码已发布于https://github.com/UNITES-Lab/HEXA-MoE。