Mixture-of-Experts (MoE) models offer high capacity with efficient inference cost by activating a small subset of expert models per input. However, deploying MoE models requires all experts to reside in memory, creating a gap between the resource used by activated experts and the provisioned resources. This underutilization is further pronounced in multi-tenant scenarios. In this paper, we propose FaaSMoE, a multi-tenant MoE serving architecture built on Function-as-a-Service (FaaS) platforms. FaaSMoE decouples the control and execution planes of MoE by deploying experts as stateless FaaS functions, enabling on-demand and scale-to-zero expert invocation across tenants. FaaSMoE further supports configurable expert granularity within functions, trading off per-expert elasticity for reduced invocation overhead. We implement a prototype with an open-source edge-oriented FaaS platform and evaluate it using Qwen1.5-moe-2.7B under multi-tenant workloads. Compared to a full-model baseline, FaaSMoE uses less than one third of the resources, demonstrating a practical and resource-efficient path towards scalable MoE serving in a multi-tenant environment.
翻译:混合专家(MoE)模型通过为每个输入仅激活少数专家子集,在实现高效推理成本的同时提供了高模型容量。然而,部署MoE模型要求所有专家常驻内存,导致激活专家使用的资源与预置资源之间存在鸿沟。这种资源利用率不足问题在多租户场景中更为突出。本文提出FaaSMoE,一种基于函数即服务(FaaS)平台构建的多租户MoE服务架构。FaaSMoE通过将专家部署为无状态FaaS函数,解耦了MoE的控制平面与执行平面,实现了跨租户的按需、可缩放到零的专家调用。该架构进一步支持函数内可配置的专家粒度,以牺牲单专家弹性为代价降低调用开销。我们基于开源边缘计算FaaS平台实现了原型系统,并使用Qwen1.5-moe-2.7B在多租户工作负载下进行了评估。与全模型基线相比,FaaSMoE仅使用了不到三分之一的资源,为多租户环境下可扩展的MoE服务提供了一条实用且资源高效的实现路径。