Large Language Models (LLMs) have gained immense success in revolutionizing various applications, including content generation, search and recommendation, and AI-assisted operation. To reduce high training costs, Mixture-of-Experts (MoE) architecture has become a popular backbone for modern LLMs. However, despite the benefits, serving MoE-based LLMs experience severe memory inefficiency due to sparsely activated experts. Recent studies propose to offload inactive experts from GPU memory to CPU memory to improve the serving efficiency of MoE models. However, they either incur high inference latency or high model memory footprints due to coarse-grained designs. To tame the latency-memory trade-off in MoE serving, we present fMoE, a fine-grained expert offloading system for MoE serving that achieves low inference latency with memory efficiency. We design fMoE to extract fine-grained expert selection patterns from MoE models and semantic hints from input prompts to efficiently guide expert prefetching, caching, and offloading decisions. fMoE is prototyped on top of HuggingFace Transformers and deployed on a six-GPU testbed. Experiments with open-source MoE models and real-world workloads show that fMoE reduces inference latency by 47% and improves expert hit rate by 36% over state-of-the-art solutions.
翻译:大型语言模型(LLM)在内容生成、搜索推荐及AI辅助运营等众多应用领域取得了革命性成功。为降低高昂的训练成本,专家混合模型(MoE)架构已成为现代LLM的主流骨干网络。然而,尽管具备显著优势,基于MoE的LLM在服务过程中因稀疏激活的专家模块而面临严重的内存效率问题。近期研究提出将非活跃专家从GPU内存卸载至CPU内存以提升MoE模型服务效率,但现有方案因设计粒度较粗,往往导致高推理延迟或高模型内存占用。为优化MoE服务中的延迟-内存权衡,本文提出fMoE——一种面向MoE服务的细粒度专家卸载系统,在保证内存效率的同时实现低推理延迟。fMoE通过提取MoE模型中的细粒度专家选择模式与输入提示的语义线索,高效指导专家预取、缓存与卸载决策。该系统基于HuggingFace Transformers框架实现原型,并部署于六GPU测试平台。基于开源MoE模型与真实工作负载的实验表明,相较于前沿解决方案,fMoE可降低47%的推理延迟,并将专家命中率提升36%。