MoE models offer efficient scaling through conditional computation, but their large parameter size and expensive expert offloading make on-device deployment challenging. Existing acceleration techniques such as prefetching or expert clustering often increase energy usage or reduce expert diversity. We present SliceMoE, an energy-efficient MoE inference framework for miss-rate-constrained deployment. SliceMoE introduces Dynamic Bit-Sliced Caching (DBSC), which caches experts at slice-level granularity and assigns precision on demand to expand effective expert capacity. To support mixed-precision experts without memory duplication, we propose Calibration-Free Asymmetric Matryoshka Quantization (AMAT), a truncation-based scheme that maintains compatibility between low-bit and high-bit slices. We further introduce Predictive Cache Warmup (PCW) to reduce early-decode cold misses by reshaping cache contents during prefill. Evaluated on DeepSeek-V2-Lite and Qwen1.5-MoE-A2.7B, SliceMoE reduces decode-stage energy consumption by up to 2.37x and 2.85x, respectively, and improves decode latency by up to 1.81x and 1.64x, while preserving near-high-bit accuracy. These results demonstrate that slice-level caching enables an efficient on-device MoE deployment.
翻译:摘要:MoE模型通过条件计算实现高效扩展,但其庞大的参数量和昂贵的专家卸载机制使得设备端部署面临挑战。现有预取或专家聚类等加速技术通常会增加能耗或降低专家多样性。我们提出SliceMoE——一种面向缺失率约束部署的能效型MoE推理框架。SliceMoE引入动态比特切片缓存(DBSC),以切片粒度缓存专家并按需分配精度以扩展有效专家容量。为支持无内存复制的混合精度专家,我们提出免校准非对称套娃式量化(AMAT)——一种基于截断的方案,可保持低比特与高比特切片间的兼容性。我们还引入预测式缓存预热(PCW),通过预填充阶段重塑缓存内容来减少解码早期冷缺失。在DeepSeek-V2-Lite和Qwen1.5-MoE-A2.7B上的评估表明,SliceMoE在保持近高比特精度的同时,分别使解码阶段能耗降低最高2.37倍和2.85倍,解码延迟分别提升最高1.81倍和1.64倍。这些结果表明,切片级缓存可实现高效的设备端MoE部署。