Mixture-of-Experts (MoE), while offering significant advantages as a Large Language Model (LLM) architecture, faces substantial challenges when deployed on low-cost edge devices with tight memory constraints. Expert offloading mitigates this issue by storing expert parameters in CPU memory and caching a subset of popular experts in GPU memory. Although this approach improves GPU memory utilization by caching only the likely-used experts, the GPU memory reserved for expert caching is underutilized compared with dense LLMs. This paper presents OD-MoE, a distributed MoE inference framework that obviates the need for expert caches via fully on-demand expert loading. OD-MoE is built upon two key mechanisms: 1) parallelizing expert loading and expert computation across distributed edge nodes, and 2) an ultra-accurate emulative predictor that forecasts expert activations multiple layers ahead while expert computation is ongoing. With these innovations, OD-MoE dynamically loads each target expert to one of the distributed nodes just-in-time before its activation and promptly evicts it afterward, freeing GPU memory for subsequent experts. We comprehensively benchmark OD-MoE against state-of-the-art MoE offloading systems on a ten-node testbed. Experimental results show that: 1) OD-MoE achieves 99.94% expert activation prediction accuracy, substantially surpassing all existing methods; and 2) OD-MoE delivers approximately 75% of the decoding speed of a fully GPU-cached MoE deployment while using only 1/3 of the GPU memory. More importantly, by eliminating the need for expert caches, OD-MoE enables MoE inference on edge nodes with less-than-1GB GPU memory, paving the way for practical MoE deployment of low-cost IoT devices at the edge in the LLM era.
翻译:混合专家模型(Mixture-of-Experts, MoE)作为一种大型语言模型(Large Language Model, LLM)架构,虽具有显著优势,但在部署于内存资源紧张的廉价边缘设备时面临重大挑战。专家卸载技术通过将专家参数存储于CPU内存,并将部分常用专家缓存于GPU内存,缓解了此问题。尽管该方法通过仅缓存可能使用的专家提高了GPU内存利用率,但与密集LLM相比,为专家缓存预留的GPU内存仍未被充分利用。本文提出OD-MoE,一种分布式MoE推理框架,通过完全按需的专家加载消除了对专家缓存的需求。OD-MoE基于两个关键机制构建:1)在分布式边缘节点间并行化专家加载与专家计算;2)一种超精准的模拟预测器,可在专家计算进行时提前多层预测专家激活。借助这些创新,OD-MoE在目标专家激活前即时将其动态加载至某一分布式节点,并在计算后立即释放,从而为后续专家腾出GPU内存。我们在一个十节点测试平台上,将OD-MoE与最先进的MoE卸载系统进行全面基准测试。实验结果表明:1)OD-MoE实现了99.94%的专家激活预测准确率,显著超越所有现有方法;2)OD-MoE在仅使用1/3 GPU内存的情况下,达到了完全GPU缓存MoE部署约75%的解码速度。更重要的是,通过消除对专家缓存的需求,OD-MoE使得在GPU内存不足1GB的边缘节点上运行MoE推理成为可能,为LLM时代低成本物联网设备在边缘端的实际MoE部署铺平了道路。