Block-Wise Mixed-Precision Quantization: Enabling High Efficiency for Practical ReRAM-based DNN Accelerators

Xueying Wu,Edward Hanson,Nansu Wang,Qilin Zheng,Xiaoxuan Yang,Huanrui Yang,Shiyu Li,Feng Cheng,Partha Pratim Pande,Janardhan Rao Doppa,Krishnendu Chakrabarty,Hai Li

from arxiv, 12 pages, 13 figures

Resistive random access memory (ReRAM)-based processing-in-memory (PIM) architectures have demonstrated great potential to accelerate Deep Neural Network (DNN) training/inference. However, the computational accuracy of analog PIM is compromised due to the non-idealities, such as the conductance variation of ReRAM cells. The impact of these non-idealities worsens as the number of concurrently activated wordlines and bitlines increases. To guarantee computational accuracy, only a limited number of wordlines and bitlines of the crossbar array can be turned on concurrently, significantly reducing the achievable parallelism of the architecture. While the constraints on parallelism limit the efficiency of the accelerators, they also provide a new opportunity for fine-grained mixed-precision quantization. To enable efficient DNN inference on practical ReRAM-based accelerators, we propose an algorithm-architecture co-design framework called \underline{B}lock-\underline{W}ise mixed-precision \underline{Q}uantization (BWQ). At the algorithm level, BWQ-A introduces a mixed-precision quantization scheme at the block level, which achieves a high weight and activation compression ratio with negligible accuracy degradation. We also present the hardware architecture design BWQ-H, which leverages the low-bit-width models achieved by BWQ-A to perform high-efficiency DNN inference on ReRAM devices. BWQ-H also adopts a novel precision-aware weight mapping method to increase the ReRAM crossbar's throughput. Our evaluation demonstrates the effectiveness of BWQ, which achieves a 6.08x speedup and a 17.47x energy saving on average compared to existing ReRAM-based architectures.

翻译：基于阻变随机存储器（ReRAM）的处理-存储一体化（PIM）架构在加速深度神经网络（DNN）训练/推理方面展现出巨大潜力。然而，由于ReRAM单元电导变化等非理想特性，模拟PIM的计算精度会受到影响。随着同时激活的字线和位线数量增加，这些非理想特性的影响会进一步加剧。为保证计算精度，交叉开关阵列中只能同时开启有限数量的字线和位线，从而显著降低了架构的可实现并行度。虽然并行度限制降低了加速器的效能，但也为细粒度混合精度量化提供了新的机遇。为在实际ReRAM基加速器上实现高效DNN推理，我们提出了一种名为“块级混合精度量化（BWQ）”的算法-架构协同设计框架。在算法层面，BWQ-A引入了块级混合精度量化方案，以可忽略的精度损失实现了高权值和激活压缩比。我们还设计了硬件架构BWQ-H，利用BWQ-A实现的低位宽模型在ReRAM设备上执行高效DNN推理。BWQ-H还采用了一种新颖的精度感知权值映射方法，以提高ReRAM交叉开关的吞吐量。评估结果表明，BWQ的有效性显著，与现有ReRAM基架构相比，平均实现了6.08倍加速和17.47倍节能。