Mixture of Experts (MoE) LLMs have recently gained attention for their ability to enhance performance by selectively engaging specialized subnetworks or "experts" for each input. However, deploying MoEs on memory-constrained devices remains challenging, particularly when generating tokens sequentially with a batch size of one, as opposed to typical high-throughput settings involving long sequences or large batches. In this work, we optimize MoE on memory-constrained devices where only a subset of expert weights fit in DRAM. We introduce a novel cache-aware routing strategy that leverages expert reuse during token generation to improve cache locality. We evaluate our approach on language modeling, MMLU, and GSM8K benchmarks and present on-device results demonstrating 2$\times$ speedups on mobile devices, offering a flexible, training-free solution to extend MoE's applicability across real-world applications.
翻译:专家混合(MoE)大语言模型因其能够针对每个输入选择性激活专用子网络或“专家”以提升性能而受到关注。然而,在内存受限的设备上部署MoE模型仍然具有挑战性,尤其是在以批大小为1的顺序生成令牌的场景下,这与通常涉及长序列或大批量的高吞吐量设置不同。在本工作中,我们针对仅部分专家权重可放入DRAM的内存受限设备优化MoE模型。我们提出了一种新颖的缓存感知路由策略,该策略利用令牌生成过程中的专家重用来改善缓存局部性。我们在语言建模、MMLU和GSM8K基准测试上评估了该方法,并展示了在移动设备上实现2倍加速的端侧结果,为扩展MoE在实际应用中的适用性提供了一种灵活、无需重新训练的解决方案。