Serving MoE Models on Resource-constrained Edge Devices via Dynamic Expert Swapping

Mixture of experts (MoE) is a popular technique in deep learning that improves model capacity with conditionally-activated parallel neural network modules (experts). However, serving MoE models in resource-constrained latency-critical edge scenarios is challenging due to the significantly increased model size and complexity. In this paper, we first analyze the behavior pattern of MoE models in continuous inference scenarios, which leads to three key observations about the expert activations, including temporal locality, exchangeability, and skippable computation. Based on these observations, we introduce PC-MoE, an inference framework for resource-constrained continuous MoE model serving. The core of PC-MoE is a new data structure, Parameter Committee, that intelligently maintains a subset of important experts in use to reduce resource consumption. The optimal configuration of Parameter Committee is found offline by a profiling-guided committee planner, and expert swapping and request handling at runtime are managed by an adaptive committee scheduler. To evaluate the effectiveness of PC-MoE, we conduct experiments using state-of-the-art MoE models on common computer vision and natural language processing tasks. The results demonstrate optimal trade-offs between resource consumption and model accuracy achieved by PC-MoE. For instance, on object detection tasks with the Swin-MoE model, our approach can reduce memory usage and latency by 42.34% and 18.63% with only 0.10% accuracy degradation.

翻译：混合专家模型（MoE）是深度学习中一种流行技术，通过条件激活的并行神经网络模块（专家）提升模型容量。然而，在资源受限且延迟敏感的边端场景中，MoE模型由于规模与复杂度显著增加而面临服务挑战。本文首先分析了连续推理场景下MoE模型的行为模式，由此得出关于专家激活的三个关键发现：时间局部性、可交换性与可跳过的计算。基于这些发现，我们提出了PC-MoE——面向资源受限连续MoE模型服务的推理框架。PC-MoE的核心是一种新型数据结构"参数委员会"，它能智能维护当前使用的重要专家子集以降低资源消耗。参数委员会的最优配置通过离线性能分析引导的委员会规划器获得，运行时专家交换与请求处理则由自适应委员会调度器管理。为评估PC-MoE的有效性，我们采用最先进的MoE模型在常见计算机视觉与自然语言处理任务上开展实验。结果表明PC-MoE实现了资源消耗与模型精度的最优权衡：以Swin-MoE模型执行目标检测任务为例，本方法可在精度仅损失0.10%的情况下，将内存使用量降低42.34%，延迟减少18.63%。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日