Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference

Large language models (LLMs) based on transformers have made significant strides in recent years, the success of which is driven by scaling up their model size. Despite their high algorithmic performance, the computational and memory requirements of LLMs present unprecedented challenges. To tackle the high compute requirements of LLMs, the Mixture-of-Experts (MoE) architecture was introduced which is able to scale its model size without proportionally scaling up its computational requirements. Unfortunately, MoE's high memory demands and dynamic activation of sparse experts restrict its applicability to real-world problems. Previous solutions that offload MoE's memory-hungry expert parameters to CPU memory fall short because the latency to migrate activated experts from CPU to GPU incurs high performance overhead. Our proposed Pre-gated MoE system effectively tackles the compute and memory challenges of conventional MoE architectures using our algorithm-system co-design. Pre-gated MoE employs our novel pre-gating function which alleviates the dynamic nature of sparse expert activation, allowing our proposed system to address the large memory footprint of MoEs while also achieving high performance. We demonstrate that Pre-gated MoE is able to improve performance, reduce GPU memory consumption, while also maintaining the same level of model quality. These features allow our Pre-gated MoE system to cost-effectively deploy large-scale LLMs using just a single GPU with high performance.

翻译：基于Transformer的大语言模型近年来取得了显著进展，其成功源于模型规模的不断扩大。尽管算法性能优异，但大语言模型的计算和内存需求带来了前所未有的挑战。为应对大语言模型的高计算需求，混合专家架构被引入，该架构能够在不按比例增加计算需求的前提下扩展模型规模。然而，混合专家模型的高内存需求及稀疏专家的动态激活特性限制了其在实际问题中的应用。以往将混合专家模型中高内存占用的专家参数卸载至CPU内存的解决方案效果有限，因为从CPU激活迁移专家至GPU的延迟会导致高昂的性能开销。我们提出的Pre-gated MoE系统通过算法-系统协同设计，有效解决了传统混合专家架构的计算和内存挑战。Pre-gated MoE采用了新型预门控函数，缓解了稀疏专家激活的动态特性，使得提出的系统既能解决混合专家模型的大内存占用问题，又能实现高性能。实验证明，Pre-gated MoE能够在提升性能、降低GPU内存消耗的同时保持同等模型质量。这些特性使我们的Pre-gated MoE系统能够仅使用单块GPU便以高性价比方式部署大规模大语言模型，并保持高性能。

相关内容

Performance

关注 3

Performance：International Symposium on Computer Performance Modeling, Measurements and Evaluation。 Explanation：计算机性能建模、测量和评估国际研讨会。 Publisher：ACM。 SIT：http://dblp.uni-trier.de/db/conf/performance/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日