Mixture of Experts (MoE) architectures significantly enhance the capacity of LLMs without proportional increases in computation, but at the cost of a vast parameter size. Offloading MoE expert parameters to host memory and leveraging both CPU and GPU computation has recently emerged as a promising direction to support such models on resourceconstrained local PC platforms. While promising, we notice that existing approaches mismatch the dynamic nature of expert workloads, which leads to three fundamental inefficiencies: (1) Static expert assignment causes severe CPUGPU load imbalance, underutilizing CPU and GPU resources; (2) Existing prefetching techniques fail to accurately predict high-workload experts, leading to costly inaccurate prefetches; (3) GPU cache policies neglect workload dynamics, resulting in poor hit rates and limited effectiveness. To address these challenges, we propose DALI, a workloaDAware offLoadIng framework for efficient MoE inference on local PCs. To fully utilize hardware resources, DALI first dynamically assigns experts to CPU or GPU by modeling assignment as a 0-1 integer optimization problem and solving it efficiently using a Greedy Assignment strategy at runtime. To improve prefetching accuracy, we develop a Residual-Based Prefetching method leveraging inter-layer residual information to accurately predict high-workload experts. Additionally, we introduce a Workload-Aware Cache Replacement policy that exploits temporal correlation in expert activations to improve GPU cache efficiency. By evaluating across various MoE models and settings, DALI achieves significant speedups in the both prefill and decoding phases over the state-of-the-art offloading frameworks.
翻译:专家混合(MoE)架构显著提升了大型语言模型的容量,而无需按比例增加计算量,但其代价是庞大的参数量。将MoE专家参数卸载至主机内存,并协同利用CPU与GPU计算资源,已成为在资源受限的本地PC平台上支持此类模型的重要研究方向。尽管前景广阔,我们发现现有方法未能匹配专家工作负载的动态特性,导致三个根本性效率缺陷:(1)静态专家分配引发严重的CPU-GPU负载失衡,造成计算资源利用率不足;(2)现有预取技术无法准确预测高负载专家,导致代价高昂的错误预取;(3)GPU缓存策略忽视工作负载动态变化,造成命中率低下且效果有限。为应对这些挑战,我们提出DALI——一个面向本地PC高效MoE推理的工作负载感知卸载框架。为充分利用硬件资源,DALI首先通过将分配问题建模为0-1整数优化问题,并在运行时采用贪心分配策略高效求解,实现专家在CPU与GPU间的动态分配。为提升预取精度,我们开发了基于残差的预取方法,利用层间残差信息精准预测高负载专家。此外,我们设计了工作负载感知缓存替换策略,通过挖掘专家激活的时间相关性来提升GPU缓存效率。通过在多种MoE模型与配置下的实验评估,DALI在预填充和解码阶段均较现有先进卸载框架实现了显著加速。