Edge deployment of low-batch large language models (LLMs) faces critical memory bandwidth bottlenecks when executing memory-intensive general matrix-vector multiplications (GEMV) operations. While digital processing-in-memory (PIM) architectures promise to accelerate GEMV operations, existing PIM-equipped edge devices still suffer from three key limitations: limited bandwidth improvement, component under-utilization in mixed workloads, and low compute capacity of computing units (CUs). In this paper, we propose CD-PIM to address these challenges through three key innovations. First, we introduce a high-bandwidth compute-efficient mode (HBCEM) that enhances bandwidth by dividing each bank into four pseudo-banks through segmented global bitlines. Second, we propose a low-batch interleaving mode (LBIM) to improve component utilization by overlapping GEMV operations with GEMM operations. Third, we design a compute-efficient CU that performs enhanced GEMV operations in a pipelined manner by serially feeding weight data into the computing core. Forth, we adopt a column-wise mapping for the key-cache matrix and row-wise mapping for the value-cache matrix, which fully utilizes CU resources. Our evaluation shows that compared to a GPU-only baseline and state-of-the-art PIM designs, our CD-PIM achieves 11.42x and 4.25x speedup on average within a single batch in HBCEM mode, respectively. Moreover, for low-batch sizes, the CD-PIM achieves an average speedup of 1.12x in LBIM compared to HBCEM.
翻译:在边缘设备上部署低批次大语言模型时,执行内存密集型通用矩阵-向量乘法操作面临严重的内存带宽瓶颈。尽管数字内存处理架构有望加速GEMV操作,但现有配备PIM的边缘设备仍存在三个关键限制:带宽提升有限、混合工作负载下组件利用率不足以及计算单元的计算能力较低。本文提出CD-PIM,通过三项关键创新应对这些挑战。首先,我们引入一种高带宽计算高效模式,通过分段全局位线将每个存储体划分为四个伪存储体,从而提升带宽。其次,我们提出一种低批次交错模式,通过重叠GEMV与GEMM操作来提高组件利用率。第三,我们设计了一种计算高效的计算单元,通过将权重数据串行馈入计算核心,以流水线方式执行增强的GEMV操作。第四,我们对键缓存矩阵采用列向映射,对值缓存矩阵采用行向映射,从而充分利用CU资源。评估结果表明,与纯GPU基准方案和最先进的PIM设计相比,我们的CD-PIM在HBCEM模式下单批次内平均分别实现了11.42倍和4.25倍的加速。此外,对于低批次规模,CD-PIM在LBIM模式下相比HBCEM平均实现了1.12倍的加速。