General matrix-vector multiplication (GeMV) remains a critical latency bottleneck in large language model (LLM) inference, even with quantized low-bit models. Processing-Using-DRAM (PUD), an analog in-DRAM computing technique, has the potential to repurpose on-device DRAM as a GeMV engine, offering additional high-throughput processing capabilities to widespread consumer devices without DRAM modifications. However, applying PUD to GeMV operations in the LLM inference pipeline incurs significant overheads $\textit{before}$ and $\textit{after}$ in-DRAM computation, diminishing the benefits of its high-throughput processing capabilities. This paper presents MVDRAM, the first practical system to accelerate GeMV operations for low-bit LLM inference using unmodified DRAM. By leveraging the data sharing patterns and mathematical linearity in GeMV operations, MVDRAM orchestrates the processor and DRAM to eliminate the costs associated with pre-arranging inputs and bit-transposition of outputs required in conventional PUD approaches. Our experimental evaluation with four DDR4 DRAM modules shows that MVDRAM achieves comparable or even better inference speed than the processor-based implementation for GeMV operations in low-bit (under 4-bit) LLM. In particular, MVDRAM achieves up to 7.29$\times$ speedup and 30.5$\times$ energy efficiency for low-bit GeMV operations. For end-to-end LLM inference, MVDRAM achieves 2.18$\times$ and 1.31$\times$ throughput improvements, along with 3.04$\times$ and 2.35$\times$ energy efficiency, for 2-bit and 4-bit quantized low-bit models, respectively. MVDRAM has the potential to redefine the AI hardware landscape by demonstrating the feasibility of standard DRAM as an LLM accelerator.
翻译:通用矩阵-向量乘法(GeMV)在大语言模型(LLM)推理中仍然是关键的延迟瓶颈,即使对于量化后的低比特模型也是如此。DRAM内计算(PUD)作为一种模拟式DRAM内计算技术,有潜力将设备上的DRAM重新用作GeMV引擎,为广泛的消费设备提供额外的高吞吐量处理能力,且无需修改DRAM硬件。然而,将PUD应用于LLM推理流水线中的GeMV运算时,会在DRAM内计算$\textit{之前}$和$\textit{之后}$产生显著开销,从而削弱了其高吞吐量处理能力的优势。本文提出了MVDRAM,这是首个利用未修改的DRAM加速低比特LLM推理中GeMV运算的实用系统。通过利用GeMV运算中的数据共享模式和数学线性特性,MVDRAM协调处理器和DRAM的工作,消除了传统PUD方法中所需的输入预排列和输出比特转置所带来的开销。我们使用四个DDR4 DRAM模块进行的实验评估表明,对于低比特(低于4比特)LLM中的GeMV运算,MVDRAM实现了与基于处理器的实现相当甚至更优的推理速度。具体而言,MVDRAM在低比特GeMV运算上实现了高达7.29$\times$的加速比和30.5$\times$的能效提升。对于端到端的LLM推理,对于2比特和4比特量化低比特模型,MVDRAM分别实现了2.18$\times$和1.31$\times$的吞吐量提升,以及3.04$\times$和2.35$\times$的能效提升。MVDRAM通过证明标准DRAM作为LLM加速器的可行性,有潜力重新定义AI硬件格局。