MoE-Infinity: Offloading-Efficient MoE Model Serving

This paper presents MoE-Infinity, an offloading-efficient serving system for sparse mixture-of-experts (MoE) models. To optimize offloading, MoE-Infinity achieves novel request-level tracing for expert activation, capturing MoE's sparse execution patterns such as selective activation, group activation, and skewed reuse. Leveraging the request-level trace, MoE-Infinity performs effective expert prefetching and expert caching, achieving high efficiency in transferring model parameters from host memory to GPU memory. Experimental results demonstrate that MoE-Infinity achieves low latency comparable to expensive full-GPU deployments, which require up to 4X more GPU resources than MoE-Infinity. Compared to offloading-supporting LLM serving systems such as DeepSpeed-Inference, Llama.cpp, Mixtral Offloading, and BrainStorm, MoE-Infinity exhibits superior latency performance, providing 2-20X improvements when serving various MoE models for a large collection of LLM tasks. MoE-Infinity's source code is publicly available a https://github.com/TorchMoE/MoE-Infinity

翻译：本文提出MoE-Infinity，一种面向稀疏专家混合（MoE）模型的卸载高效服务系统。为优化卸载过程，MoE-Infinity实现了新颖的请求级专家激活追踪机制，能够捕捉MoE特有的稀疏执行模式，包括选择性激活、组激活及偏斜复用。基于请求级追踪信息，MoE-Infinity执行高效的专家预取与专家缓存策略，显著提升了模型参数从主机内存到GPU内存的传输效率。实验结果表明，MoE-Infinity在达到与昂贵全GPU部署方案相近的低延迟水平的同时，所需GPU资源仅为后者的四分之一。相较于支持卸载的现有LLM服务系统（如DeepSpeed-Inference、Llama.cpp、Mixtral Offloading和BrainStorm），MoE-Infinity在延迟性能方面表现卓越，在为大量LLM任务部署各类MoE模型时，可实现2至20倍的性能提升。MoE-Infinity的源代码已在 https://github.com/TorchMoE/MoE-Infinity 公开。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日