NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing

Modern transformer-based Large Language Models (LLMs) are constructed with a series of decoder blocks. Each block comprises three key components: (1) QKV generation, (2) multi-head attention, and (3) feed-forward networks. In batched processing, QKV generation and feed-forward networks involve compute-intensive matrix-matrix multiplications (GEMM), while multi-head attention requires bandwidth-heavy matrix-vector multiplications (GEMV). Machine learning accelerators like TPUs or NPUs are proficient in handling GEMM but are less efficient for GEMV computations. Conversely, Processing-in-Memory (PIM) technology is tailored for efficient GEMV computation, while it lacks the computational power to effectively handle GEMM. Inspired by this insight, we propose NeuPIMs, a heterogeneous accelerator-based system that jointly exploits a conventional GEMM-focused NPU and GEMV-optimized PIM devices. The main challenge in efficiently integrating NPU and PIM lies in enabling concurrent operations on both platforms, each addressing a specific kernel type. First, existing PIMs typically operate in a "blocked" mode, allowing only either NPU or PIM to be active at any given time. Second, the inherent dependencies between GEMM and GEMV in LLMs restrict their parallel processing. To tackle these challenges, NeuPIMs is equipped with dual row buffers in each bank, facilitating the simultaneous management of memory read/write operations and PIM commands. Further, NeuPIMs employs a runtime sub-batch interleaving technique to maximize concurrent execution, leveraging batch parallelism to allow two independent sub-batches to be pipelined within a single NeuPIMs node. Our evaluation demonstrates that compared to an NPU-only approach and a na\"ive NPU-PIM integrated system, NeuPIMs achieves 2.3$\times$ and 1.6$\times$ throughput improvement, respectively.

翻译：现代基于Transformer的大语言模型（LLMs）由一系列解码器模块构成。每个模块包含三个关键组件：(1) QKV生成、(2) 多头注意力和(3) 前馈网络。在批量处理中，QKV生成和前馈网络涉及计算密集型的矩阵-矩阵乘法（GEMM），而多头注意力需要带宽密集型的矩阵-向量乘法（GEMV）。TPU或NPU等机器学习加速器擅长处理GEMM，但对GEMV计算效率较低。相反，存内计算（PIM）技术专为高效GEMV计算设计，但缺乏有效处理GEMM的计算能力。受此启发，我们提出NeuPIMs——一种异构加速器系统，联合利用传统面向GEMM的NPU和专为GEMV优化的PIM设备。高效集成NPU与PIM的主要挑战在于实现两个平台（各自处理特定内核类型）的并发操作。首先，现有PIM通常以"阻塞"模式运行，任意时刻仅允许NPU或PIM之一处于活跃状态。其次，LLM中GEMM与GEMV之间的固有依赖关系限制了它们的并行处理。为解决这些挑战，NeuPIMs在每个存储体中配备双行缓冲区，实现内存读写操作与PIM命令的同步管理。此外，NeuPIMs采用运行时子批次交错技术最大化并发执行，通过利用批次并行性，使两个独立子批次可在单个NeuPIMs节点内流水线处理。评估表明，与纯NPU方案及朴素NPU-PIM集成系统相比，NeuPIMs分别实现了2.3倍和1.6倍的吞吐量提升。