While Mixture-of-Experts (MoE) architectures substantially bolster the expressive power of large-language models, their prohibitive memory footprint severely impedes the practical deployment on resource-constrained edge devices, especially when model behavior must be preserved without relying on lossy quantization. In this paper, we present ZipMoE, an efficient and semantically lossless on-device MoE serving system. ZipMoE exploits the synergy between the hardware properties of edge devices and the statistical redundancy inherent to MoE parameters via a caching-scheduling co-design with provable performance guarantee. Fundamentally, our design shifts the paradigm of on-device MoE inference from an I/O-bound bottleneck to a compute-centric workflow that enables efficient parallelization. We implement a prototype of ZipMoE and conduct extensive experiments on representative edge computing platforms using popular open-source MoE models and real-world workloads. Our evaluation reveals that ZipMoE achieves up to $72.77\%$ inference latency reduction and up to $6.76\times$ higher throughput than the state-of-the-art systems.
翻译:尽管混合专家(Mixture-of-Experts,MoE)架构极大地增强了大型语言模型的表达能力,但其巨大的内存占用严重阻碍了在资源受限的边缘设备上的实际部署,尤其是在必须保持模型行为、不能依赖有损量化的情况下。本文提出ZipMoE,一个高效且语义无损的设备端MoE服务系统。ZipMoE通过一种具有可证明性能保证的缓存-调度协同设计,充分利用边缘设备的硬件特性与MoE参数固有的统计冗余性。从根本上,我们的设计将设备端MoE推理的范式从I/O瓶颈转变为以计算为中心的工作流,从而实现高效的并行化。我们实现了ZipMoE的原型,并在具有代表性的边缘计算平台上使用流行的开源MoE模型和真实工作负载进行了大量实验。评估结果表明,与现有最先进的系统相比,ZipMoE最高可降低72.77%的推理延迟,并实现高达6.76倍的吞吐量提升。