Large language models (LLMs) based on transformers have made significant strides in recent years, the success of which is driven by scaling up their model size. Despite their high algorithmic performance, the computational and memory requirements of LLMs present unprecedented challenges. To tackle the high compute requirements of LLMs, the Mixture-of-Experts (MoE) architecture was introduced which is able to scale its model size without proportionally scaling up its computational requirements. Unfortunately, MoE's high memory demands and dynamic activation of sparse experts restrict its applicability to real-world problems. Previous solutions that offload MoE's memory-hungry expert parameters to CPU memory fall short because the latency to migrate activated experts from CPU to GPU incurs high performance overhead. Our proposed Pre-gated MoE system effectively tackles the compute and memory challenges of conventional MoE architectures using our algorithm-system co-design. Pre-gated MoE employs our novel pre-gating function which alleviates the dynamic nature of sparse expert activation, allowing our proposed system to address the large memory footprint of MoEs while also achieving high performance. We demonstrate that Pre-gated MoE is able to improve performance, reduce GPU memory consumption, while also maintaining the same level of model quality. These features allow our Pre-gated MoE system to cost-effectively deploy large-scale LLMs using just a single GPU with high performance.
翻译:基于Transformer的大型语言模型(LLMs)近年来取得了显著进展,其成功源于模型规模的持续扩大。尽管算法性能卓越,但LLMs对计算和内存的需求带来了前所未有的挑战。为应对LLMs的高计算需求,混合专家(MoE)架构被提出,该架构能够在不按比例增加计算需求的前提下扩展模型规模。然而,MoE的高内存需求以及稀疏专家的动态激活特性限制了其在实际问题中的应用。此前将MoE高内存需求的专家参数卸载至CPU内存的方案存在缺陷,因为从CPU迁移激活专家至GPU的延迟会导致较高的性能开销。我们提出的预门控MoE系统通过算法-系统协同设计,有效解决了传统MoE架构的计算和内存挑战。该系统采用新颖的预门控函数,缓解了稀疏专家激活的动态特性,使系统在应对MoE高内存占用的同时实现高性能。实验表明,预门控MoE能够在保持相同模型质量水平的前提下,提升性能并降低GPU内存消耗。这些特性使得我们的预门控MoE系统能够以低成本在单GPU上高效部署大规模LLMs,并保持高性能。