EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models

Large Language Models (LLMs) such as GPTs and LLaMa have ushered in a revolution in machine intelligence, owing to their exceptional capabilities in a wide range of machine learning tasks. However, the transition of LLMs from data centers to edge devices presents a set of challenges and opportunities. While this shift can enhance privacy and availability, it is hampered by the enormous parameter sizes of these models, leading to impractical runtime costs. In light of these considerations, we introduce EdgeMoE, the first on-device inference engine tailored for mixture-of-expert (MoE) LLMs, a popular variant of sparse LLMs that exhibit nearly constant computational complexity as their parameter size scales. EdgeMoE achieves both memory and computational efficiency by strategically partitioning the model across the storage hierarchy. Specifically, non-expert weights are stored in the device's memory, while expert weights are kept in external storage and are fetched into memory only when they are activated. This design is underpinned by a crucial insight that expert weights, though voluminous, are infrequently accessed due to sparse activation patterns. To further mitigate the overhead associated with expert I/O swapping, EdgeMoE incorporates two innovative techniques: (1) Expert-wise bitwidth adaptation: This method reduces the size of expert weights with an acceptable level of accuracy loss. (2) Expert management: It predicts the experts that will be activated in advance and preloads them into the compute-I/O pipeline, thus further optimizing the process. In empirical evaluations conducted on well-established MoE LLMs and various edge devices, EdgeMoE demonstrates substantial memory savings and performance improvements when compared to competitive baseline solutions.

翻译：大型语言模型（LLMs），如GPTs和LLaMa，凭借其在广泛机器学习任务中的卓越能力，引发了机器智能的革命。然而，将LLMs从数据中心迁移至边缘设备既带来挑战也蕴含机遇。尽管这一转变能够增强隐私性和可用性，但模型庞大的参数量导致不可行的运行时成本，从而阻碍了其实现。基于上述考量，我们提出EdgeMoE——首个专为混合专家（MoE）LLMs设计的设备端推理引擎。MoE LLMs是稀疏LLM的流行变体，其计算复杂度随参数量扩展几乎保持不变。EdgeMoE通过策略性地将模型分区到存储层级中，实现了内存与计算效率的双重提升。具体而言，非专家权重存储在设备内存中，而专家权重则保存在外部存储中，仅在激活时才被加载至内存。这一设计基于一个关键洞察：尽管专家权重庞大，但由于稀疏激活模式，其访问频率较低。为进一步缓解专家I/O交换的开销，EdgeMoE引入了两项创新技术：（1）专家级位宽自适应：该方法在可接受的精度损失范围内缩小专家权重的大小。（2）专家管理：该方法提前预测即将激活的专家，并将其预加载至计算-I/O流水线中，从而进一步优化流程。在针对成熟的MoE LLMs及多种边缘设备进行的实证评估中，与竞争性基线方案相比，EdgeMoE展示了显著的内存节省与性能提升。