While Mixture-of-Experts (MoE) architectures substantially bolster the expressive power of large-language models, their prohibitive memory footprint severely impedes the practical deployment on resource-constrained edge devices, especially when model behavior must be preserved without relying on lossy quantization. In this paper, we present ZipMoE, an efficient and semantically lossless on-device MoE serving system. ZipMoE exploits the synergy between the hardware properties of edge devices and the statistical redundancy inherent to MoE parameters via a caching-scheduling co-design with provable performance guarantee. Fundamentally, our design shifts the paradigm of on-device MoE inference from an I/O-bound bottleneck to a compute-centric workflow that enables efficient parallelization. We implement a prototype of ZipMoE and conduct extensive experiments on representative edge computing platforms using popular open-source MoE models and real-world workloads. Our evaluation reveals that ZipMoE achieves up to $72.77\%$ inference latency reduction and up to $6.76\times$ higher throughput than the state-of-the-art systems.Our code is available at: https://github.com/npnothard/ZipMoE-ICML26.
翻译:混合专家(MoE)架构在显著增强大语言模型表达能力的同时,其庞大的内存占用严重阻碍了在资源受限边缘设备上的实际部署,尤其是在不依赖有损量化而保留模型行为的情况下。本文提出ZipMoE——一种高效且语义无损的端侧MoE推理系统。ZipMoE通过缓存-调度协同设计,在具备可证明性能保障的前提下,巧妙地融合了边缘设备的硬件特性与MoE参数固有的统计冗余。从根本上,我们的设计将端侧MoE推理从I/O瓶颈范式转变为以计算为中心、支持高效并行化的流程。我们实现了ZipMoE原型,并使用主流开源MoE模型与真实负载,在代表性边缘计算平台上进行了广泛实验。评估结果显示,与当前最优系统相比,ZipMoE实现了高达$72.77\%$的推理延迟降低与最高$6.76\times$的吞吐量提升。我们的代码已开源至:https://github.com/npnothard/ZipMoE-ICML26。