NeuPIMs: A NPU-PIM Heterogeneous Acceleration for Batched Inference of Large Language Model

Modern transformer-based Large Language Models (LLMs) are constructed with a series of decoder blocks. Each block comprises three key components: (1) QKV generation, (2) multi-head attention, and (3) feed-forward networks. In batched processing, QKV generation and feed-forward networks involve compute-intensive matrix-matrix multiplications (GEMM), while multi-head attention requires bandwidth-heavy matrix-vector multiplications (GEMV). Machine learning accelerators like TPUs or NPUs are proficient in handling GEMM but are less efficient for GEMV computations. Conversely, Processing-in-Memory (PIM) technology is tailored for efficient GEMV computation, while it lacks the computational power to effectively handle GEMM. Inspired by this insight, we propose NeuPIMs, a heterogeneous accelerator-based system that jointly exploits a conventional GEMM-focused NPU and GEMV-optimized PIM devices. The main challenge in efficiently integrating NPU and PIM lies in enabling concurrent operations on both platforms, each addressing a specific kernel type. First, existing PIMs typically operate in a "blocked" mode, allowing only either NPU or PIM to be active at any given time. Second, the inherent dependencies between GEMM and GEMV in LLMs restrict their parallel processing. To tackle these challenges, NeuPIMs is equipped with dual row buffers in each bank, facilitating the simultaneous management of memory read/write operations and PIM commands. Further, NeuPIMs employs a runtime sub-batch interleaving technique to maximize concurrent execution, leveraging batch parallelism to allow two independent sub-batches to be pipelined within a single NeuPIMs node. Our evaluation demonstrates that compared to an NPU-only approach and a na\"ive NPU-PIM integrated system, NeuPIMs achieves 2.3$\times$ and 1.6$\times$ throughput improvement, respectively.

翻译：现代基于Transformer的大语言模型由一系列解码器模块构成，每个模块包含三个关键组件：（1）QKV生成、（2）多头注意力机制和（3）前馈网络。在批量处理中，QKV生成与前馈网络涉及计算密集型的矩阵-矩阵乘法，而多头注意力机制需要带宽密集型的矩阵-向量乘法。TPU或NPU等机器学习加速器擅长处理矩阵乘法，但对矩阵向量乘法的计算效率较低。相反，存内处理技术专为高效实现矩阵向量乘法而设计，但其计算能力不足以有效处理矩阵乘法。受此启发，我们提出NeuPIMs——一种异构加速器系统，通过协同利用传统以矩阵乘法为核心的NPU和针对矩阵向量乘法优化的PIM设备。高效集成NPU与PIM的主要挑战在于实现两个平台（分别处理特定内核类型）的并发运算：首先，现有PIM通常以"阻塞"模式运行，仅允许NPU或PIM在任一时刻处于激活状态；其次，大语言模型中矩阵乘法与矩阵向量乘法之间的内在依赖性限制了并行处理。为应对这些挑战，NeuPIMs在每个存储体配备双行缓冲区，实现内存读写操作与PIM命令的同步管理。进一步地，NeuPIMs采用运行时子批次交错技术最大化并发执行，利用批次并行性实现两个独立子批次在单个NeuPIMs节点中的流水线处理。评估表明，与纯NPU方案和朴素NPU-PIM集成系统相比，NeuPIMs的吞吐量分别提升2.3倍和1.6倍。