MoEless: Efficient MoE LLM Serving via Serverless Computing

Large Language Models (LLMs) have become a cornerstone of AI, driving progress across diverse domains such as content creation, search and recommendation systems, and AI-assisted workflows. To alleviate extreme training costs and advancing model scales, Mixture-of-Experts (MoE) has become a popular backbone for modern LLMs, which are commonly served in distributed deployment using expert parallelism (EP). However, MoE's sparse activation mechanism leads to severe expert load imbalance, where a few experts become overloaded while others remain idle, resulting in expert stragglers that inflate inference latency and serving cost. Existing expert load balancing solutions assume static resource configurations on serverful infrastructures, limiting expert scalability and elasticity, and resulting in either costly real-time expert swapping or degraded generation quality. We present MoEless, the first serverless MoE serving framework that mitigates expert load imbalance and accelerates inference via serverless experts. MoEless employs lightweight, layer-aware predictors to accurately estimate incoming expert load distributions and proactively identify stragglers. We design optimized expert scaling and placement strategies to maximize function locality, improve GPU utilization, and balance loads across experts and GPUs. MoEless is prototyped on top of Megatron-LM and deployed on an eight-GPU testbed. Experiments with open-source MoE models and real-world workloads show that MoEless reduces inference latency by 43% and inference cost by 84% compared to state-of-the-art solutions.

翻译：大型语言模型已成为人工智能的基石，在内容生成、搜索推荐系统及AI辅助工作流等多个领域推动着技术进步。为缓解极高的训练成本并扩展模型规模，混合专家模型已成为现代大语言模型的流行架构，通常采用专家并行策略进行分布式部署服务。然而，MoE的稀疏激活机制会导致严重的专家负载不均衡：少数专家过载而多数专家闲置，产生拖慢推理延迟并推高服务成本的专家滞后问题。现有专家负载均衡方案基于静态资源配置的服务器基础设施，限制了专家的可扩展性与弹性，导致要么需要代价高昂的实时专家切换，要么牺牲生成质量。本文提出MoEless——首个通过无服务器专家机制缓解专家负载不均衡并加速推理的无服务器MoE服务框架。MoEless采用轻量级、层级感知的预测器，精准预估输入专家负载分布并主动识别滞后专家。我们设计了优化的专家扩缩容与放置策略，以最大化函数局部性、提升GPU利用率，并实现跨专家与GPU的负载均衡。MoEless基于Megatron-LM实现原型系统，部署于八GPU测试平台。使用开源MoE模型与真实工作负载的实验表明，相较于现有最优方案，MoEless可降低43%的推理延迟并减少84%的推理成本。