Mixture-of-Experts (MoE) has become a dominant architecture for scaling large language models (LLMs). However, the execution characteristics of MoE inference are changing rapidly and increasingly mismatch the assumptions underlying existing Processing-in-Memory (PIM) systems. Prior PIM systems for LLMs rely on static rules to offload memory-bound operations to PIM, without accounting for the combined effects of load imbalance and inter-GPU communication. Meanwhile, modern MoE models activate fewer experts out of increasingly many, creating a bimodal expert distribution: a small set of experts receives many tokens, while a long tail of experts receives only one or a few. We identify a trend in modern MoE models toward increasingly bimodal token-to-expert distributions, quantify the resulting disparity in arithmetic intensity across experts, and show that this disparity dramatically reduces the efficiency of state-of-the-art PIM systems for LLMs. To address this problem, we propose a scheduler for serving MoE models on multi-GPU systems with attached HBM-PIM stacks. Our scheduler partitions expert execution between GPU and PIM based on runtime token-to-expert distributions, while jointly considering interconnect overhead, memory bandwidth, GPU throughput, and PIM throughput. Moreover, we propose Sieve, a runtime framework that employs the scheduler to coordinate execution across GPUs and their attached HBM-PIM stacks. Sieve overlaps GPU computation, PIM computation, and intra- and inter-device communication while preserving cross-device dependencies induced by expert parallelism. Sieve is evaluated on our cycle-accurate simulator based on Ramulator 2.0. Compared to state-of-the-art PIM systems for MoE, Sieve improves both throughput and interactivity by 1.3x, 1.3x, and 1.6x on Qwen3.5-397B-A17B, GPT-OSS-120B, and Qwen3-30B-A3B, respectively.
翻译:混合专家模型已成为扩展大规模语言模型的主流架构。然而,MoE推理的执行特征正在快速演变,其与现有存内计算系统的假设条件日益失配。现有面向LLM的PIM系统依赖静态规则将内存密集型运算卸载至PIM,未充分考量负载不均衡与GPU间通信的联合效应。同时,现代MoE模型从持续增多的专家中激活更少专家,形成双峰型专家分布:少量专家接收大量令牌,而长尾专家仅接收单个或少量令牌。我们识别出当代MoE模型往更显著双峰化令牌-专家分布的演变趋势,量化由此导致的专家间运算强度差异,并揭示该差异如何显著降低现有LLM最优PIM系统的效率。针对此问题,我们提出面向配备HBM-PIM堆栈的多GPU系统服务MoE模型的调度器。该调度器基于运行时令牌-专家分布,协同考量互连开销、内存带宽、GPU吞吐量及PIM吞吐量,在GPU与PIM间划分专家执行任务。此外,我们提出Sieve运行时框架,通过调度器协调GPU及其附属HBM-PIM堆栈的执行。Sieve在保持专家并行性所引发的跨设备依赖关系的同时,实现GPU计算、PIM计算及片内/片间通信的三重叠加。基于Ramulator 2.0的周期精确模拟器评估表明,相比现有面向MoE的先进PIM系统,Sieve在Qwen3.5-397B-A17B、GPT-OSS-120B及Qwen3-30B-A3B模型上分别实现1.3倍、1.3倍和1.6倍的吞吐量与交互性提升。