Mixture-of-Experts (MoE) models have become the dominant architecture for large-scale language models, yet on-premises serving remains fundamentally memory-bound as batching turns sparse per-token compute into dense memory activation. Memory-centric architectures (PIM, NMP) improve bandwidth but leave compute underutilized under MoE's low arithmetic intensity at high batch sizes. Speculative decoding (SD) trades idle compute for fewer target invocations, yet verification must load experts even for rejected tokens, severely limiting its benefit in MoE especially at low batch sizes. We propose ELMoE-3D, a hybrid-bonding (HB)-based HW-SW co-designed framework that unifies cache-based acceleration and speculative decoding to offer overall speedup across batch sizes. We identify two intrinsic elasticity axes of MoE-expert and bit-and jointly scale them to construct Elastic Self-Speculative Decoding (Elastic-SD), which serves as both an expert cache and a strongly aligned self-draft model accelerated by high HB bandwidth. Our LSB-augmented bit-sliced architecture exploits inherent redundancy in bit-slice representations to natively support bit-nested execution. On our 3D-stacked hardware, ELMoE-3D achieves an average $6.6\times$ speedup and $4.4\times$ energy efficiency gain over naive MoE serving on xPU across batch sizes 1-16, and delivers $2.2\times$ speedup and $1.4\times$ energy efficiency gain over the best-performing prior accelerator baseline.
翻译:混合专家(MoE)模型已成为大规模语言模型的主流架构,然而在本地服务场景中,由于批处理将稀疏的逐令牌计算转化为密集的内存激活,系统本质上仍受限于内存带宽。以内存为中心的架构(PIM、NMP)虽提升了带宽,但在高批量下MoE的低算术强度导致计算资源利用率不足。推测解码(SD)利用闲置计算资源减少目标模型调用次数,但其验证过程即使对拒绝令牌仍需加载专家参数,严重限制了其在MoE中的优势,尤其在低批量场景下。我们提出ELMoE-3D,一种基于混合键合(HB)的软硬件协同设计框架,通过统一缓存加速与推测解码实现跨批处理规模的全局加速。我们识别出MoE的两个内在弹性维度——专家与比特——并对其联合缩放以构建弹性自推测解码(Elastic-SD),该机制既可充当专家缓存,也可作为由高HB带宽加速的强对齐自草稿模型。我们提出的低功耗位切片架构利用位切片表征的固有冗余性,天然支持位嵌套执行。在3D堆叠硬件上,ELMoE-3D在批处理规模1-16范围内,相较于xPU上的原始MoE服务实现平均6.6倍加速与4.4倍能效提升;相较于性能最佳的先前加速器基线,实现2.2倍加速与1.4倍能效提升。