Deep neural networks are widely deployed in many fields. Due to the in-situ computation (known as processing in memory) capacity of the Resistive Random Access Memory (ReRAM) crossbar, ReRAM-based accelerator shows potential in accelerating DNN with low power and high performance. However, despite power advantage, such kind of accelerators suffer from the high power consumption of peripheral circuits, especially Analog-to-Digital Converter (ADC), which account for over 60 percent of total power consumption. This problem hinders the ReRAM-based accelerator to achieve higher efficiency. Some redundant Analog-to-Digital conversion operations have no contribution to maintaining inference accuracy, and such operations can be eliminated by modifying the ADC searching logic. Based on such observations, we propose an algorithm-hardware co-design method and explore the co-design approach in both hardware design and quantization algorithms. Firstly, we focus on the distribution output along the crossbar's bit-lines and identify the fine-grained redundant ADC sampling bits. % of weight and To further compress ADC bits, we propose a hardware-friendly quantization method and coding scheme, in which different quantization strategy was applied to the partial results in different intervals. To support the two features above, we propose a lightweight architectural design based on SAR-ADC\@. It's worth mentioning that our method is not only more energy efficient but also retains the flexibility of the algorithm. Experiments demonstrate that our method can reduce about $1.6 \sim 2.3 \times$ ADC power reduction.
翻译:深度神经网络已广泛应用于众多领域。凭借电阻式随机存取存储器(ReRAM)交叉阵列的存内计算能力,ReRAM加速器展现出以低功耗和高性能加速深度神经网络的潜力。然而,尽管在功耗方面具有优势,这类加速器仍受限于外围电路的高功耗,尤其是模数转换器(ADC)的功耗占比超过总功耗的60%。这一问题阻碍了ReRAM加速器实现更高效率。部分冗余的模数转换操作对维持推理精度并无贡献,可通过修改ADC搜索逻辑予以消除。基于这些发现,我们提出一种算法-硬件协同设计方法,从硬件设计和量化算法两个维度探索协同优化途径。首先,我们聚焦于交叉阵列位线上的输出分布特性,识别出细粒度的冗余ADC采样比特位。为进一步压缩ADC比特数,我们提出一种硬件友好的量化方法与编码方案,对不同区间的部分结果采用差异化的量化策略。为支撑上述两项特性,我们设计了一种基于逐次逼近寄存器型ADC的轻量级架构。值得强调的是,该方法不仅实现了更高能效,还保留了算法的灵活性。实验表明,我们的方法可实现约1.6~2.3倍的ADC功耗降低。