SwapMoE: Efficient Memory-Constrained Serving of Large Sparse MoE Models via Dynamic Expert Pruning and Swapping

Mixture of experts (MoE) is a popular technique to improve capacity of large models with conditionally-activated parallel neural network modules (experts). Due to its remarkable scaling performance with sparse computation, it is widely used in modern Large Language Models (LLMs) and Large Vision Models (LVMs). However, serving such large models on edge devices is challenging due to memory constraints. Typical solutions like memory swapping or weight pruning may lead to significantly higher latency or severe accuracy loss. In this paper, we introduce SwapMoE, a framework for efficient continuous MoE-based large models serving with tunable memory budgets. The main idea of SwapMoE is to keep a small dynamic set of important experts, namely Virtual Experts, in the main memory for inference, while seamlessly maintaining how the Virtual Experts map to the actual experts. We use a profiling-guided planner to allocate the resources for SwapMoE that can fully utilize the memory budgets and bandwidth, and an importance-aware scheduler to efficiently identify, update, and use the Virtual Experts for accurate inference. To evaluate SwapMoE, we conduct experiments on multiple edge devices with state-of-the-art MoE-based Large Language Models and Large Vision Models. The results demonstrate remarkable performance of SwapMoE under various memory constraints. Specifically, SwapMoE can enable running large MoE models under tight memory budgets with similar latency to pruned compact models, while with significantly higher accuracy.

翻译：混合专家模型（MoE）是一种通过条件激活并行神经网络模块（专家）来提升大模型容量的流行技术。由于其稀疏计算带来的卓越扩展性能，该技术被广泛应用于现代大语言模型（LLM）和大视觉模型（LVM）中。然而，在边缘设备上部署此类大规模模型受限于内存约束。典型解决方案如内存交换或权重剪枝可能导致显著延迟增加或严重精度损失。本文提出SwapMoE框架，通过可调节内存预算实现基于MoE的大规模模型高效连续服务。SwapMoE的核心思想是将少量动态重要专家（称为虚拟专家）保留在主存中进行推理，同时无缝维护虚拟专家与真实专家的映射关系。我们采用基于性能分析的规划器为SwapMoE分配资源以充分利用内存预算和带宽，并通过重要性感知调度器高效识别、更新和使用虚拟专家进行精确推理。为评估SwapMoE，我们在多个边缘设备上使用最新MoE大语言模型和大视觉模型进行实验。结果表明SwapMoE在各种内存约束下均展现出卓越性能。具体而言，SwapMoE能够在严格内存预算下运行大型MoE模型，其延迟与剪枝后的紧凑模型相当，同时精度显著更高。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日