Mixture of experts (MoE) is a popular technique to improve capacity of large models with conditionally-activated parallel neural network modules (experts). Due to its remarkable scaling performance with sparse computation, it is widely used in modern Large Language Models (LLMs) and Large Vision Models (LVMs). However, serving such large models on edge devices is challenging due to memory constraints. Typical solutions like memory swapping or weight pruning may lead to significantly higher latency or severe accuracy loss. In this paper, we introduce SwapMoE, a framework for efficient continuous MoE-based large models serving with tunable memory budgets. The main idea of SwapMoE is to keep a small dynamic set of important experts, namely Virtual Experts, in the main memory for inference, while seamlessly maintaining how the Virtual Experts map to the actual experts. We use a profiling-guided planner to allocate the resources for SwapMoE that can fully utilize the memory budgets and bandwidth, and an importance-aware scheduler to efficiently identify, update, and use the Virtual Experts for accurate inference. To evaluate SwapMoE, we conduct experiments on multiple edge devices with state-of-the-art MoE-based Large Language Models and Large Vision Models. The results demonstrate remarkable performance of SwapMoE under various memory constraints. Specifically, SwapMoE can enable running large MoE models under tight memory budgets with similar latency to pruned compact models, while with significantly higher accuracy.
翻译:混合专家模型(MoE)是一种通过条件激活并行神经网络模块(专家)来提升大模型容量的流行技术。由于其稀疏计算带来的卓越扩展性能,该技术被广泛应用于现代大语言模型(LLM)和大视觉模型(LVM)中。然而,在边缘设备上部署此类大规模模型受限于内存约束。典型解决方案如内存交换或权重剪枝可能导致显著延迟增加或严重精度损失。本文提出SwapMoE框架,通过可调节内存预算实现基于MoE的大规模模型高效连续服务。SwapMoE的核心思想是将少量动态重要专家(称为虚拟专家)保留在主存中进行推理,同时无缝维护虚拟专家与真实专家的映射关系。我们采用基于性能分析的规划器为SwapMoE分配资源以充分利用内存预算和带宽,并通过重要性感知调度器高效识别、更新和使用虚拟专家进行精确推理。为评估SwapMoE,我们在多个边缘设备上使用最新MoE大语言模型和大视觉模型进行实验。结果表明SwapMoE在各种内存约束下均展现出卓越性能。具体而言,SwapMoE能够在严格内存预算下运行大型MoE模型,其延迟与剪枝后的紧凑模型相当,同时精度显著更高。