This paper presents MoE-Infinity, an offloading-efficient serving system for sparse mixture-of-experts (MoE) models. To optimize offloading, MoE-Infinity achieves novel request-level tracing for expert activation, capturing MoE's sparse execution patterns such as selective activation, group activation, and skewed reuse. Leveraging the request-level trace, MoE-Infinity performs effective expert prefetching and expert caching, achieving high efficiency in transferring model parameters from host memory to GPU memory. Experimental results demonstrate that MoE-Infinity achieves low latency comparable to expensive full-GPU deployments, which require up to 4X more GPU resources than MoE-Infinity. Compared to offloading-supporting LLM serving systems such as DeepSpeed-Inference, Llama.cpp, Mixtral Offloading, and BrainStorm, MoE-Infinity exhibits superior latency performance, providing 2-20X improvements when serving various MoE models for a large collection of LLM tasks. MoE-Infinity's source code is publicly available a https://github.com/TorchMoE/MoE-Infinity
翻译:本文提出MoE-Infinity,一种面向稀疏专家混合(MoE)模型的卸载高效服务系统。为优化卸载过程,MoE-Infinity实现了新颖的请求级专家激活追踪机制,能够捕捉MoE特有的稀疏执行模式,包括选择性激活、组激活及偏斜复用。基于请求级追踪信息,MoE-Infinity执行高效的专家预取与专家缓存策略,显著提升了模型参数从主机内存到GPU内存的传输效率。实验结果表明,MoE-Infinity在达到与昂贵全GPU部署方案相近的低延迟水平的同时,所需GPU资源仅为后者的四分之一。相较于支持卸载的现有LLM服务系统(如DeepSpeed-Inference、Llama.cpp、Mixtral Offloading和BrainStorm),MoE-Infinity在延迟性能方面表现卓越,在为大量LLM任务部署各类MoE模型时,可实现2至20倍的性能提升。MoE-Infinity的源代码已在 https://github.com/TorchMoE/MoE-Infinity 公开。