This paper presents MoE-Infinity, an efficient MoE inference system designed for personal machines with limited GPU memory capacity. The key idea for MoE-Infinity is that on personal machines, which are often single-user environments, MoE-based LLMs typically operate with a batch size of one. In this setting, MoE models exhibit a high degree of activation sparsity, meaning a small number of experts are frequently reused in generating tokens during the decode phase. Leveraging this idea, we design a sparsity-aware expert cache, which can trace the sparse activation of experts during inference and carefully select the trace that represents the sparsity pattern. By analyzing these selected traces, MoE-Infinity guides the replacement and prefetching of the expert cache, providing 3.1-16.7x per-token latency improvements over numerous state-of-the-art systems, including vLLM, Ollama, DeepSpeed and BrainStorm across various MoE models (DeepSeek and Mixtral) when handling different LLM tasks. MoE-Infinity's source code is publicly available at https://github.com/EfficientMoE/MoE-Infinity
翻译:本文提出MoE-Infinity,一种专为GPU内存容量有限的个人机器设计的高效MoE推理系统。MoE-Infinity的核心思想在于:在通常为单用户环境的个人机器上,基于MoE的大型语言模型通常以批大小为1的方式运行。在此设置下,MoE模型表现出高度的激活稀疏性,意味着在解码阶段生成令牌时,仅有少量专家被频繁重复使用。基于这一观察,我们设计了一种稀疏感知专家缓存机制,该机制能够追踪推理过程中专家的稀疏激活状态,并精心选择能代表稀疏模式的轨迹。通过分析这些选定轨迹,MoE-Infinity指导专家缓存的替换与预取,在处理不同LLM任务时,相较于包括vLLM、Ollama、DeepSpeed和BrainStorm在内的多种先进系统,在多个MoE模型(DeepSeek和Mixtral)上实现了3.1至16.7倍的每令牌延迟提升。MoE-Infinity的源代码已在https://github.com/EfficientMoE/MoE-Infinity 公开。