Large language model (LLM) decoding is a major inference bottleneck because its low arithmetic intensity makes performance highly sensitive to memory bandwidth. 3D-stacked near-memory processing (NMP) provides substantially higher local memory bandwidth than conventional off-chip interfaces, making it a promising substrate for decode acceleration. However, our analysis shows that this bandwidth advantage also shifts many decode operators on 3D-stacked NMP back into the compute-bound regime. Under the tight area budget of the logic die, the design of the compute substrate itself therefore becomes a first-order challenge. Therefore, we rethink the compute microarchitecture of prior 3D-stacked NMP designs. First, we replace prior MAC tree-based compute units with a more area-efficient systolic array, and we further observe that decode operators exhibit substantial shape diversity, making reconfigurability in both systolic array shape and dataflow essential for sustaining high utilization. Building on this insight, we continue to exploit two key opportunities: the high local memory bandwidth reduces the need for large on-chip buffers, and the existing vector core, originally designed to handle auxiliary tensor computations, already provides much of the control logic and multi-ported buffering required for fine-grained flexibility for systolic array, allowing us to unify the two structures in a highly area-efficient manner. Based on these insights, we present the first compute microarchitecture tailored to 3D-stacked NMP LLM decoding, explicitly designed to satisfy the joint requirements of low area cost, high-bandwidth operation, and fine-grained reconfigurability. We further propose an multi-core scheduling framework. Compared with Stratum, our design achieves an average 2.91x speedup and 2.40x higher energy efficiency across both dense and MoE models.
翻译:大语言模型(LLM)解码是推理过程中的主要瓶颈,因其较低的计算强度导致性能对内存带宽高度敏感。3D堆叠近存处理(NMP)相比传统片外接口提供了显著更高的本地内存带宽,使其成为加速解码的理想基底。然而,我们的分析表明,这一带宽优势反过来将3D堆叠NMP上的许多解码算子重新推入计算受限区域。在逻辑芯片严格的面积预算下,计算基底本身的设计因而成为首要挑战。为此,我们重新思考了先前3D堆叠NMP设计中的计算微架构。首先,我们采用面积效率更高的脉动阵列替代基于MAC树的先前计算单元,并进一步发现解码算子表现出显著的形状多样性,使得脉动阵列形状和数据流的可重构性成为维持高利用率的必要条件。基于这一洞见,我们继续利用两个关键机遇:高本地内存带宽降低了对大型片上缓冲区的需求,而原本设计用于处理辅助张量计算的现有向量核心,已为脉动阵列的细粒度灵活性提供了大部分控制逻辑和多端口缓冲能力,这使得我们能够以高面积效率的方式统一这两种结构。基于这些洞察,我们提出了首个专为3D堆叠NMP LLM解码定制的计算微架构,明确设计以同时满足低面积成本、高带宽运行和细粒度可重构性的要求。我们进一步提出了一个多核调度框架。与Stratum相比,我们的设计在密集模型和MoE模型上平均实现了2.91倍的加速比和2.40倍的能效提升。