ELMoE-3D: Leveraging Intrinsic Elasticity of MoE for Hybrid-Bonding-Enabled Self-Speculative Decoding in On-Premises Serving

Mixture-of-Experts (MoE) models have become the dominant architecture for large-scale language models, yet on-premises serving remains fundamentally memory-bound as batching turns sparse per-token compute into dense memory activation. Memory-centric architectures (PIM, NMP) improve bandwidth but leave compute underutilized under MoE's low arithmetic intensity at high batch sizes. Speculative decoding (SD) trades idle compute for fewer target invocations, yet verification must load experts even for rejected tokens, severely limiting its benefit in MoE especially at low batch sizes. We propose ELMoE-3D, a hybrid-bonding (HB)-based HW-SW co-designed framework that unifies cache-based acceleration and speculative decoding to offer overall speedup across batch sizes. We identify two intrinsic elasticity axes of MoE-expert and bit-and jointly scale them to construct Elastic Self-Speculative Decoding (Elastic-SD), which serves as both an expert cache and a strongly aligned self-draft model accelerated by high HB bandwidth. Our LSB-augmented bit-sliced architecture exploits inherent redundancy in bit-slice representations to natively support bit-nested execution. On our 3D-stacked hardware, ELMoE-3D achieves an average $6.6\times$ speedup and $4.4\times$ energy efficiency gain over naive MoE serving on xPU across batch sizes 1-16, and delivers $2.2\times$ speedup and $1.4\times$ energy efficiency gain over the best-performing prior accelerator baseline.

翻译：混合专家（MoE）模型已成为大规模语言模型的主流架构，然而在本地部署中，由于批处理将稀疏的每令牌计算转化为密集的内存激活，系统本质上面临内存瓶颈。以内存为中心的计算架构（PIM、NMP）虽能提升带宽，但在高批处理量下，MoE模型低计算强度导致计算资源利用率不足。推测解码（SD）通过利用空闲计算资源减少目标模型调用次数，但其验证阶段仍需加载被拒绝令牌对应的专家权重，严重限制了该方法在MoE模型中的效率提升，尤其在低批处理量场景下更为明显。我们提出ELMoE-3D——一种基于混合键合（HB）的软硬件协同设计框架，通过统一缓存加速与推测解码，实现跨批处理量的整体加速。我们识别出MoE模型的两个内在弹性维度——专家粒度与比特粒度，通过联合缩放二者构建弹性自推测解码（Elastic-SD），该机制兼具专家缓存与强对齐自草稿模型功能，并由高HB带宽加速。本文提出的基于最低有效位（LSB）增强的比特切片架构，利用比特切片表示中的固有冗余特性，原生支持比特嵌套执行。在3D堆叠硬件上，ELMoE-3D在1-16的批处理量范围内，相较于基于XPU的朴素MoE部署实现平均6.6倍加速比与4.4倍能效提升，相较于现有最佳加速器基线方案实现2.2倍加速比与1.4倍能效提升。