Large language models (LLMs) based on transformers have made significant strides in recent years, the success of which is driven by scaling up their model size. Despite their high algorithmic performance, the computational and memory requirements of LLMs present unprecedented challenges. To tackle the high compute requirements of LLMs, the Mixture-of-Experts (MoE) architecture was introduced which is able to scale its model size without proportionally scaling up its computational requirements. Unfortunately, MoE's high memory demands and dynamic activation of sparse experts restrict its applicability to real-world problems. Previous solutions that offload MoE's memory-hungry expert parameters to CPU memory fall short because the latency to migrate activated experts from CPU to GPU incurs high performance overhead. Our proposed Pre-gated MoE system effectively tackles the compute and memory challenges of conventional MoE architectures using our algorithm-system co-design. Pre-gated MoE employs our novel pre-gating function which alleviates the dynamic nature of sparse expert activation, allowing our proposed system to address the large memory footprint of MoEs while also achieving high performance. We demonstrate that Pre-gated MoE is able to improve performance, reduce GPU memory consumption, while also maintaining the same level of model quality. These features allow our Pre-gated MoE system to cost-effectively deploy large-scale LLMs using just a single GPU with high performance.
翻译:基于Transformer的大型语言模型(LLMs)近年来取得了显著进展,其成功源于模型规模的扩展。尽管算法性能卓越,但LLMs的计算与内存需求带来了前所未有的挑战。为应对LLMs的高计算需求,混合专家(Mixture-of-Experts, MoE)架构被引入,该架构能够在模型规模扩展时避免计算需求同比增加。然而,MoE的高内存需求与稀疏专家的动态激活特性限制了其在现实问题中的应用。此前将MoE内存密集型专家参数卸载至CPU内存的方案效果有限,因为将激活的专家从CPU迁移至GPU的延迟会造成高昂的性能开销。本文提出的Pre-gated MoE系统通过算法-系统协同设计,有效解决了传统MoE架构的计算与内存挑战。Pre-gated MoE采用新型预门控函数,缓解了稀疏专家激活的动态特性,使所提系统既能应对MoE的大内存占用问题,又能实现高性能。实验表明,Pre-gated MoE在维持同等模型质量的同时,能够提升性能并降低GPU内存消耗。这些特性使得Pre-gated MoE系统能够在单GPU上以高性价比高效部署大规模LLMs。