Block-Wise Mixed-Precision Quantization: Enabling High Efficiency for Practical ReRAM-based DNN Accelerators

Xueying Wu,Edward Hanson,Nansu Wang,Qilin Zheng,Xiaoxuan Yang,Huanrui Yang,Shiyu Li,Feng Cheng,Partha Pratim Pande,Janardhan Rao Doppa,Krishnendu Chakrabarty, Hai, Li

from arxiv, 12 pages, 13 figures

Resistive random access memory (ReRAM)-based processing-in-memory (PIM) architectures have demonstrated great potential to accelerate Deep Neural Network (DNN) training/inference. However, the computational accuracy of analog PIM is compromised due to the non-idealities, such as the conductance variation of ReRAM cells. The impact of these non-idealities worsens as the number of concurrently activated wordlines and bitlines increases. To guarantee computational accuracy, only a limited number of wordlines and bitlines of the crossbar array can be turned on concurrently, significantly reducing the achievable parallelism of the architecture. While the constraints on parallelism limit the efficiency of the accelerators, they also provide a new opportunity for fine-grained mixed-precision quantization. To enable efficient DNN inference on practical ReRAM-based accelerators, we propose an algorithm-architecture co-design framework called \underline{B}lock-\underline{W}ise mixed-precision \underline{Q}uantization (BWQ). At the algorithm level, BWQ-A introduces a mixed-precision quantization scheme at the block level, which achieves a high weight and activation compression ratio with negligible accuracy degradation. We also present the hardware architecture design BWQ-H, which leverages the low-bit-width models achieved by BWQ-A to perform high-efficiency DNN inference on ReRAM devices. BWQ-H also adopts a novel precision-aware weight mapping method to increase the ReRAM crossbar's throughput. Our evaluation demonstrates the effectiveness of BWQ, which achieves a 6.08x speedup and a 17.47x energy saving on average compared to existing ReRAM-based architectures.

翻译：基于阻变存储器（ReRAM）的存内计算（PIM）架构在加速深度神经网络（DNN）训练/推理方面展现出巨大潜力。然而，由于非理想特性（如ReRAM单元的电导变化），模拟PIM的计算精度会受到影响。随着同时激活的字线和位线数量增加，这些非理想特性的影响进一步加剧。为确保计算精度，交叉开关阵列中只能同时激活有限数量的字线和位线，这显著降低了架构的可实现的并行度。尽管并行度限制影响了加速器的效率，但也为细粒度混合精度量化提供了新机遇。为在实际ReRAM基加速器上实现高效DNN推理，我们提出了一种算法-架构协同设计框架——块级混合精度量化（BWQ）。在算法层面，BWQ-A在块级别引入混合精度量化方案，以极小的精度损失实现高权重和激活压缩比。我们同时设计了硬件架构BWQ-H，利用BWQ-A实现的低位宽模型在ReRAM器件上进行高效DNN推理。BWQ-H还采用了一种新颖的精度感知权重映射方法，以提升ReRAM交叉阵列的吞吐量。评估表明，BWQ在现有ReRAM架构基础上平均实现6.08倍加速和17.47倍能耗节省。