The Mixture-of-Experts (MoE) architecture has demonstrated significant advantages in the era of Large Language Models (LLMs), offering enhanced capabilities with reduced inference costs. However, deploying MoE-based LLMs on memoryconstrained edge devices remains challenging due to their substantial memory requirements. While existing expertoffloading methods alleviate the memory requirements, they often incur significant expert-loading costs or compromise model accuracy. We present HOBBIT, a mixed precision expert offloading system to enable flexible and efficient MoE inference. Our key insight is that dynamically replacing less critical cache-miss experts with low precision versions can substantially reduce expert-loading latency while preserving model accuracy. HOBBIT introduces three innovative techniques that map the natural hierarchy of MoE computation: (1) a token-level dynamic expert loading mechanism, (2) a layer-level adaptive expert prefetching technique, and (3) a sequence-level multidimensional expert caching policy. These innovations fully leverage the benefits of mixedprecision expert inference. By implementing HOBBIT on top of the renowned LLM inference framework Llama.cpp, we evaluate its performance across different edge devices with representative MoE models. The results demonstrate that HOBBIT achieves up to a 9.93x speedup in decoding compared to state-of-the-art MoE offloading systems.
翻译:混合专家(MoE)架构在大语言模型(LLM)时代展现出显著优势,能以更低的推理成本提供增强的能力。然而,由于MoE基LLM巨大的内存需求,将其部署在内存受限的边缘设备上仍然具有挑战性。现有的专家卸载方法虽然缓解了内存需求,但通常会带来显著的专家加载开销或损害模型精度。我们提出了HOBBIT,一种混合精度专家卸载系统,旨在实现灵活高效的MoE推理。我们的核心洞见是,动态地将缓存未命中中重要性较低的专家替换为其低精度版本,可以在保持模型精度的同时,大幅减少专家加载延迟。HOBBIT引入了三种映射MoE计算天然层次结构的创新技术:(1)令牌级动态专家加载机制,(2)层级自适应专家预取技术,以及(3)序列级多维专家缓存策略。这些创新充分利用了混合精度专家推理的优势。通过在著名的LLM推理框架Llama.cpp之上实现HOBBIT,我们在不同的边缘设备上使用代表性的MoE模型评估其性能。结果表明,与最先进的MoE卸载系统相比,HOBBIT在解码速度上最高可达到9.93倍的加速。