fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving

Large Language Models (LLMs) have gained immense success in revolutionizing various applications, including content generation, search and recommendation, and AI-assisted operation. To reduce high training costs, Mixture-of-Experts (MoE) architecture has become a popular backbone for modern LLMs. However, despite the benefits, serving MoE-based LLMs experience severe memory inefficiency due to sparsely activated experts. Recent studies propose to offload inactive experts from GPU memory to CPU memory to improve the serving efficiency of MoE models. However, they either incur high inference latency or high model memory footprints due to coarse-grained designs. To tame the latency-memory trade-off in MoE serving, we present fMoE, a fine-grained expert offloading system for MoE serving that achieves low inference latency with memory efficiency. We design fMoE to extract fine-grained expert selection patterns from MoE models and semantic hints from input prompts to efficiently guide expert prefetching, caching, and offloading decisions. fMoE is prototyped on top of HuggingFace Transformers and deployed on a six-GPU testbed. Experiments with open-source MoE models and real-world workloads show that fMoE reduces inference latency by 47% and improves expert hit rate by 36% over state-of-the-art solutions.

翻译：大型语言模型（LLM）在内容生成、搜索推荐及AI辅助运营等众多应用领域取得了革命性成功。为降低高昂的训练成本，专家混合模型（MoE）架构已成为现代LLM的主流骨干网络。然而，尽管具备显著优势，基于MoE的LLM在服务过程中因稀疏激活的专家模块而面临严重的内存效率问题。近期研究提出将非活跃专家从GPU内存卸载至CPU内存以提升MoE模型服务效率，但现有方案因设计粒度较粗，往往导致高推理延迟或高模型内存占用。为优化MoE服务中的延迟-内存权衡，本文提出fMoE——一种面向MoE服务的细粒度专家卸载系统，在保证内存效率的同时实现低推理延迟。fMoE通过提取MoE模型中的细粒度专家选择模式与输入提示的语义线索，高效指导专家预取、缓存与卸载决策。该系统基于HuggingFace Transformers框架实现原型，并部署于六GPU测试平台。基于开源MoE模型与真实工作负载的实验表明，相较于前沿解决方案，fMoE可降低47%的推理延迟，并将专家命中率提升36%。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日