ReDy: A Novel ReRAM-centric Dynamic Quantization Approach for Energy-efficient CNN Inference

The primary operation in DNNs is the dot product of quantized input activations and weights. Prior works have proposed the design of memory-centric architectures based on the Processing-In-Memory (PIM) paradigm. Resistive RAM (ReRAM) technology is especially appealing for PIM-based DNN accelerators due to its high density to store weights, low leakage energy, low read latency, and high performance capabilities to perform the DNN dot-products massively in parallel within the ReRAM crossbars. However, the main bottleneck of these architectures is the energy-hungry analog-to-digital conversions (ADCs) required to perform analog computations in-ReRAM, which penalizes the efficiency and performance benefits of PIM. To improve energy-efficiency of in-ReRAM analog dot-product computations we present ReDy, a hardware accelerator that implements a ReRAM-centric Dynamic quantization scheme to take advantage of the bit serial streaming and processing of activations. The energy consumption of ReRAM-based DNN accelerators is directly proportional to the numerical precision of the input activations of each DNN layer. In particular, ReDy exploits that activations of CONV layers from Convolutional Neural Networks (CNNs), a subset of DNNs, are commonly grouped according to the size of their filters and the size of the ReRAM crossbars. Then, ReDy quantizes on-the-fly each group of activations with a different numerical precision based on a novel heuristic that takes into account the statistical distribution of each group. Overall, ReDy greatly reduces the activity of the ReRAM crossbars and the number of A/D conversions compared to an static 8-bit uniform quantization. We evaluate ReDy on a popular set of modern CNNs. On average, ReDy provides 13\% energy savings over an ISAAC-like accelerator with negligible accuracy loss and area overhead.

翻译：深度神经网络中的主要运算是量化后的输入激活值与权重的点积。已有研究提出了基于存内计算（PIM）范式的存储中心架构设计。阻变存储器（ReRAM）技术因其高存储密度、低泄漏能耗、低读取延迟以及在ReRAM交叉阵列内实现大规模并行深度神经网络点积运算的高性能能力，在基于PIM的深度神经网络加速器中极具吸引力。然而，这类架构的主要瓶颈是进行ReRAM内模拟计算所需的能耗密集型模数转换（ADC），这削弱了PIM的能效和性能优势。为提升ReRAM内模拟点积计算的能效，我们提出ReDy——一种硬件加速器，它实现了一种以ReRAM为中心的动态量化方案，以利用激活值的比特串行流式传输与处理。基于ReRAM的深度神经网络加速器的能耗与每层输入激活值的数值精度成正比。具体而言，ReDy利用了卷积神经网络（CNN，深度神经网络的子集）中卷积层的激活值通常根据滤波器尺寸和ReRAM交叉阵列大小进行分组的特性。然后，ReDy根据一种考虑每组激活值统计分布的新型启发式方法，以不同数值精度对每组激活值进行动态量化。总体而言，与静态8位均匀量化相比，ReDy显著减少了ReRAM交叉阵列的活动以及模数转换的次数。我们在一组流行的现代卷积神经网络上评估了ReDy。平均而言，ReDy相比类似ISAAC的加速器节省了13%的能耗，且精度损失和面积开销可忽略不计。