This paper presents MoE-Infinity, a cost-efficient mixture-of-expert (MoE) serving system that realizes activation-aware expert offloading. MoE-Infinity features sequence-level expert activation tracing, a new approach adept at identifying sparse activations and capturing the temporal locality of MoE inference. By analyzing these traces, MoE-Infinity performs novel activation-aware expert prefetching and caching, substantially reducing the latency overheads usually associated with offloading experts for improved cost performance. Extensive experiments in a cluster show that MoE-Infinity outperforms numerous existing systems and approaches, reducing latency by 4 - 20X and decreasing deployment costs by over 8X for various MoEs. MoE-Infinity's source code is publicly available at https://github.com/TorchMoE/MoE-Infinity
翻译:本文提出MoE-Infinity,一种实现激活感知专家卸载的高效混合专家(MoE)服务系统。MoE-Infinity的核心创新在于序列级别的专家激活追踪技术,该技术能准确识别稀疏激活模式并捕获MoE推理的时间局部性特征。通过分析这些追踪数据,MoE-Infinity实现了新颖的激活感知专家预取与缓存机制,显著降低了传统专家卸载方案带来的延迟开销,从而提升成本效益。在集群中的大量实验表明,MoE-Infinity在多种MoE模型上优于现有系统与方法,延迟降低4至20倍,部署成本降低超过8倍。MoE-Infinity的源代码已开源发布于https://github.com/TorchMoE/MoE-Infinity