Decoder-only Transformer models such as GPT have demonstrated superior performance in text generation, by autoregressively predicting the next token. However, the performance of GPT is bounded by low compute-to-memory-ratio and high memory access. Throughput-oriented architectures such as GPUs target parallel processing rather than sequential token generation, and are not efficient for GPT acceleration, particularly on-device inference applications. Process-in-memory (PIM) architectures can significantly reduce data movement and provide high computation parallelism, and are promising candidates to accelerate GPT inference. In this work, we propose PIM-GPT that aims to achieve high throughput, high energy efficiency and end-to-end acceleration of GPT inference. PIM-GPT leverages DRAM-based PIM solutions to perform multiply-accumulate (MAC) operations on the DRAM chips, greatly reducing data movement. A compact application-specific integrated chip (ASIC) is designed and synthesized to initiate instructions to PIM chips and support data communication along with necessary arithmetic computations. At the software level, the mapping scheme is designed to maximize data locality and computation parallelism by partitioning a matrix among DRAM channels and banks to utilize all in-bank computation resources concurrently. We develop an event-driven clock-cycle accurate simulator to validate the efficacy of the proposed PIM-GPT architecture. Overall, PIM-GPT achieves 41$-$137$\times$, 631$-$1074$\times$ speedup and 339$-$1085$\times$, 890$-$1632$\times$ energy efficiency over GPU and CPU baseline, respectively, on 8 GPT models with up to 1.4 billion parameters.
翻译:仅解码器Transformer模型(如GPT)通过自回归方式预测下一个标记,在文本生成中展现出卓越性能。然而,GPT的性能受限于低计算-内存比和高内存访问。面向吞吐量的架构(如GPU)主要针对并行处理而非序列标记生成,因此对GPT加速(特别是设备端推理应用)效率不高。处理-内存(PIM)架构能显著减少数据移动并提供高计算并行性,是加速GPT推理的有力候选方案。本研究提出PIM-GPT,旨在实现GPT推理的高吞吐量、高能效和端到端加速。PIM-GPT利用基于DRAM的PIM解决方案在DRAM芯片上执行乘加(MAC)运算,大幅降低数据移动。设计并综合了一个紧凑型专用集成电路(ASIC),用于向PIM芯片发起指令并支持数据通信及必要的算术计算。在软件层面,通过将矩阵分区到DRAM通道和存储体之间,最大化数据局部性和计算并行性,以同时利用所有存储体内部的计算资源。我们开发了一个事件驱动的时钟周期精确模拟器来验证所提出的PIM-GPT架构的有效性。总体而言,在最多包含14亿参数的8个GPT模型上,PIM-GPT相比GPU和CPU基线分别实现了41-137倍、631-1074倍的加速比以及339-1085倍、890-1632倍的能效提升。