The substantial memory bandwidth and computational demands of large language models (LLMs) present critical challenges for efficient inference. To tackle this, the literature has explored heterogeneous systems that combine neural processing units (NPUs) with DRAM-based processing-in-memory (PIM) for LLM acceleration. However, the high-precision PIM compute units incur significant area and power overhead in DRAM technology, limiting the effective computation throughput. In this paper, we introduce P3-LLM, a novel NPU-PIM integrated accelerator for edge LLM inference. Our approach is threefold: First, we propose a flexible mixed-precision quantization scheme, which leverages hybrid numerical formats to quantize different LLM operands with high compression efficiency and minimal accuracy loss. Second, we architect an efficient PIM accelerator for P3-LLM, featuring enhanced compute units to support hybrid numerical formats. Our careful choice of numerical formats allows to co-design low-precision PIM compute units that significantly boost the computation throughput under iso-area constraints. Third, we optimize the low-precision dataflow of different LLM modules by applying operator fusion to minimize the overhead of runtime dequantization. Evaluations on diverse LLMs and tasks demonstrate that P3-LLM achieves higher accuracy than state-of-the-art KV-cache quantization and weight-activation quantization algorithms. Combining the proposed quantization scheme with low-precision PIM architecture co-design, P3-LLM yields an average of $4.9\times$, $2.0\times$, and $3.4\times$ speedups over state-of-the-art LLM accelerators HBM-PIM, Ecco, and Pimba, respectively. Code is available at https://github.com/yc2367/P3-LLM.
翻译:大语言模型(LLMs)对内存带宽和计算能力的巨大需求给高效推理带来了严峻挑战。为此,已有研究探索了结合神经网络处理单元(NPU)与基于DRAM的存内处理(PIM)的异构系统以加速LLMs。然而,高精度PIM计算单元在DRAM技术中会引入显著的面积和功耗开销,限制了有效计算吞吐量。本文提出P3-LLM——一种面向边缘LLM推理的新型NPU-PIM集成加速器。我们的方法包含三方面:首先,提出一种灵活的混合精度量化方案,利用混合数值格式对不同LLM运算数进行量化,在实现高压缩效率的同时最大限度减少精度损失;其次,为P3-LLM设计高效的PIM加速器架构,通过增强计算单元支持混合数值格式。精心选择的数值格式使得我们能够协同设计低精度PIM计算单元,在等面积约束下显著提升计算吞吐量;最后,通过应用算子融合优化不同LLM模块的低精度数据流,以最小化运行时反量化的开销。在多种LLMs和任务上的评估表明,P3-LLM在精度上优于最先进的KV缓存量化和权重-激活量化算法。通过将所提出的量化方案与低精度PIM架构协同设计相结合,P3-LLM相较于当前最先进的LLM加速器HBM-PIM、Ecco和Pimba,分别实现了平均4.9倍、2.0倍和3.4倍的加速效果。代码已开源:https://github.com/yc2367/P3-LLM。