Mixture of experts (MoE) is a popular technique in deep learning that improves model capacity with conditionally-activated parallel neural network modules (experts). However, serving MoE models in resource-constrained latency-critical edge scenarios is challenging due to the significantly increased model size and complexity. In this paper, we first analyze the behavior pattern of MoE models in continuous inference scenarios, which leads to three key observations about the expert activations, including temporal locality, exchangeability, and skippable computation. Based on these observations, we introduce PC-MoE, an inference framework for resource-constrained continuous MoE model serving. The core of PC-MoE is a new data structure, Parameter Committee, that intelligently maintains a subset of important experts in use to reduce resource consumption. The optimal configuration of Parameter Committee is found offline by a profiling-guided committee planner, and expert swapping and request handling at runtime are managed by an adaptive committee scheduler. To evaluate the effectiveness of PC-MoE, we conduct experiments using state-of-the-art MoE models on common computer vision and natural language processing tasks. The results demonstrate optimal trade-offs between resource consumption and model accuracy achieved by PC-MoE. For instance, on object detection tasks with the Swin-MoE model, our approach can reduce memory usage and latency by 42.34% and 18.63% with only 0.10% accuracy degradation.
翻译:混合专家模型(MoE)是深度学习中一种流行技术,通过条件激活的并行神经网络模块(专家)提升模型容量。然而,在资源受限且延迟敏感的边端场景中,MoE模型由于规模与复杂度显著增加而面临服务挑战。本文首先分析了连续推理场景下MoE模型的行为模式,由此得出关于专家激活的三个关键发现:时间局部性、可交换性与可跳过的计算。基于这些发现,我们提出了PC-MoE——面向资源受限连续MoE模型服务的推理框架。PC-MoE的核心是一种新型数据结构"参数委员会",它能智能维护当前使用的重要专家子集以降低资源消耗。参数委员会的最优配置通过离线性能分析引导的委员会规划器获得,运行时专家交换与请求处理则由自适应委员会调度器管理。为评估PC-MoE的有效性,我们采用最先进的MoE模型在常见计算机视觉与自然语言处理任务上开展实验。结果表明PC-MoE实现了资源消耗与模型精度的最优权衡:以Swin-MoE模型执行目标检测任务为例,本方法可在精度仅损失0.10%的情况下,将内存使用量降低42.34%,延迟减少18.63%。