Mixture-of-Experts (MoE) models have gained popularity in achieving state-of-the-art performance in a wide range of tasks in computer vision and natural language processing. They effectively expand the model capacity while incurring a minimal increase in computation cost during training. However, deploying such models for inference is difficult due to their large size and complex communication pattern. In this work, we provide a characterization of two MoE workloads, namely Language Modeling (LM) and Machine Translation (MT) and identify their sources of inefficiencies at deployment. We propose three optimization techniques to mitigate sources of inefficiencies, namely (1) Dynamic gating, (2) Expert Buffering, and (3) Expert load balancing. We show that dynamic gating improves maximum throughput by 6.21-11.23$\times$ for LM, 5.75-10.98$\times$ for MT Encoder and 2.58-5.71$\times$ for MT Decoder. It also reduces memory usage by up to 1.36$\times$ for LM and up to 1.1$\times$ for MT. We further propose Expert Buffering, a new caching mechanism that only keeps hot, active experts in GPU memory while buffering the rest in CPU memory. This reduces static memory allocation by up to 1.47$\times$. We finally propose a load balancing methodology that provides additional scalability to the workload.
翻译:混合专家(MoE)模型在计算机视觉和自然语言处理的广泛任务中,因实现顶尖性能而广受欢迎。该模型在训练过程中以极小的计算成本有效扩展了模型容量。然而,由于其规模庞大且通信模式复杂,此类模型在推理阶段的部署面临困难。本文针对语言建模(LM)和机器翻译(MT)两类MoE工作负载进行了特征分析,识别了它们在部署时的低效根源。我们提出了三种优化技术以缓解低效问题:即(1)动态门控、(2)专家缓冲及(3)专家负载均衡。实验表明,动态门控使LM的最大吞吐量提升6.21-11.23倍,MT编码器提升5.75-10.98倍,MT解码器提升2.58-5.71倍。同时,该技术将LM的内存占用降低至1.36倍,MT降低至1.1倍。我们进一步提出专家缓冲这一新型缓存机制,该机制仅将活跃的热专家保留在GPU内存中,其余专家则缓存于CPU内存,从而将静态内存分配降低至1.47倍。最后,我们提出了一种负载均衡方法,为工作负载提供了额外的可扩展性。