To deploy large Mixture-of-Experts (MoE) models cost-effectively, offloading-based single-GPU heterogeneous inference is crucial. While GPU-CPU architectures that offload cold experts are constrained by host memory bandwidth, emerging GPU-NDP architectures utilize DIMM-NDP to offload non-hot experts. However, non-hot experts are not a homogeneous memory-bound group: a significant subset of warm experts exists is severely penalized by high GPU I/O latency yet can saturate NDP compute throughput, exposing a critical compute gap. We present TriMoE, a novel GPU-CPU-NDP architecture that fills this gap by synergistically leveraging AMX-enabled CPU to precisely map hot, warm, and cold experts onto their optimal compute units. We further introduce a bottleneck-aware expert scheduling policy and a prediction-driven dynamic relayout/rebalancing scheme. Experiments demonstrate that TriMoE achieves up to 2.83x speedup over state-of-the-art solutions.
翻译:为经济高效地部署大型专家混合模型,基于卸载的单GPU异构推理至关重要。虽然通过卸载冷专家到CPU的GPU-CPU架构受限于主机内存带宽,新兴的GPU-NDP架构利用DIMM-NDP来卸载非热专家。然而,非热专家并非同质的内存受限群体:其中存在一个显著的温专家子集,它们受高GPU I/O延迟严重影响,却能够充分利用NDP的计算吞吐量,从而暴露出关键的计算缺口。本文提出TriMoE,一种新颖的GPU-CPU-NDP架构,通过协同利用支持AMX的CPU,将热、温和冷专家精确映射至其最优计算单元,从而填补这一缺口。我们进一步引入了瓶颈感知的专家调度策略和预测驱动的动态重布局/再平衡方案。实验表明,TriMoE相比现有最优解决方案实现了最高2.83倍的加速。