3D-Stacked NMP, LLM Decoding, Systolic Array Microarchitecture, Multi-Core Scheduling

Large language model (LLM) decoding is a major inference bottleneck because its low arithmetic intensity makes performance highly sensitive to memory bandwidth. 3D-stacked near-memory processing (NMP) provides substantially higher local memory bandwidth than conventional off-chip interfaces, making it a promising substrate for decode acceleration. However, our analysis shows that this bandwidth advantage also shifts many decode operators on 3D-stacked NMP back into the compute-bound regime. Under the tight area budget of the logic die, the design of the compute substrate itself therefore becomes a first-order challenge. Therefore, we rethink the compute microarchitecture of prior 3D-stacked NMP designs. First, we replace prior MAC tree-based compute units with a more area-efficient systolic array, and we further observe that decode operators exhibit substantial shape diversity, making reconfigurability in both systolic array shape and dataflow essential for sustaining high utilization. Building on this insight, we continue to exploit two key opportunities: the high local memory bandwidth reduces the need for large on-chip buffers, and the existing vector core, originally designed to handle auxiliary tensor computations, already provides much of the control logic and multi-ported buffering required for fine-grained flexibility for systolic array, allowing us to unify the two structures in a highly area-efficient manner. Based on these insights, we present the first compute microarchitecture tailored to 3D-stacked NMP LLM decoding, explicitly designed to satisfy the joint requirements of low area cost, high-bandwidth operation, and fine-grained reconfigurability. We further propose an multi-core scheduling framework. Compared with Stratum, our design achieves an average 2.91x speedup and 2.40x higher energy efficiency across both dense and MoE models.

翻译：大语言模型解码是推理过程中的主要瓶颈，因其低算术强度导致性能高度依赖内存带宽。3D堆叠近存计算技术通过提供远超传统片外接口的本地内存带宽，成为加速解码的理想载体。然而分析表明，该带宽优势使得3D堆叠近存计算上的诸多解码算子重新回归计算受限状态。在逻辑芯片面积预算严格受限的条件下，计算子系统的设计本身成为首要挑战。为此我们重新审视了先前3D堆叠近存计算方案的计算微架构：首先用面积效率更高的脉动阵列替代原有的MAC树计算单元，并发现解码算子呈现显著的形状多样性，因此脉动阵列的形状与数据流均需具备可重构性以维持高利用率。基于这一发现，我们进一步利用两个关键机遇：高本地内存带宽降低了对大容量片上缓存的需求，而原本用于辅助张量计算的向量核心已具备支持脉动阵列细粒度灵活性所需的大部分控制逻辑与多端口缓冲能力，使得两个结构能以极高面积效率实现统一。基于上述洞察，我们首次提出专为3D堆叠近存计算大语言模型解码设计的计算微架构，该架构明确满足低面积成本、高带宽运行与细粒度可重构性的联合需求。此外我们还提出了多核调度框架。与Stratum相比，我们的设计在稠密模型和混合专家模型上分别实现了平均2.91倍加速比和2.40倍能效提升。