Analog Compute-in-Memory (CiM) accelerators are increasingly recognized for their efficiency in accelerating Deep Neural Networks (DNN). However, their dependence on Analog-to-Digital Converters (ADCs) for accumulating partial sums from crossbars leads to substantial power and area overhead. Moreover, the high area overhead of ADCs constrains the throughput due to the limited number of ADCs that can be integrated per crossbar. An approach to mitigate this issue involves the adoption of extreme low-precision quantization (binary or ternary) for partial sums. Training based on such an approach eliminates the need for ADCs. While this strategy effectively reduces ADC costs, it introduces the challenge of managing numerous floating-point scale factors, which are trainable parameters like DNN weights. These scale factors must be multiplied with the binary or ternary outputs at the columns of the crossbar to ensure system accuracy. To that effect, we propose an algorithm-hardware co-design approach, where DNNs are first trained with quantization-aware training. Subsequently, we introduce HCiM, an ADC-Less Hybrid Analog-Digital CiM accelerator. HCiM uses analog CiM crossbars for performing Matrix-Vector Multiplication operations coupled with a digital CiM array dedicated to processing scale factors. This digital CiM array can execute both addition and subtraction operations within the memory array, thus enhancing processing speed. Additionally, it exploits the inherent sparsity in ternary quantization to achieve further energy savings. Compared to an analog CiM baseline architecture using 7 and 4-bit ADC, HCiM achieves energy reductions up to 28% and 12%, respectively
翻译:模拟存内计算(CiM)加速器因其在加速深度神经网络(DNN)方面的高效性而日益受到认可。然而,其对模数转换器(ADC)的依赖(用于累加来自交叉阵列的部分和)导致了大量的功耗和面积开销。此外,ADC的高面积开销限制了可集成至每个交叉阵列的ADC数量,从而制约了吞吐量。缓解该问题的一种方法是对部分和采用极低精度量化(二值或三值)。基于此类方法的训练消除了对ADC的需求。该策略虽有效降低了ADC成本,但引入了管理大量浮点比例因子的挑战,这些比例因子与DNN权重一样属于可训练参数。为确保系统精度,这些比例因子需与交叉阵列列上的二值或三值输出相乘。为此,我们提出了一种算法-硬件协同设计方法:首先使用量化感知训练对DNN进行训练。随后,我们引入HCiM——一种无ADC的混合模拟-数字CiM加速器。HCiM采用模拟CiM交叉阵列执行矩阵-向量乘法运算,并配备一个专用的数字CiM阵列来处理比例因子。该数字CiM阵列可在存储阵列内执行加法和减法操作,从而提升处理速度。此外,它利用三值量化固有的稀疏性实现进一步的能效提升。与使用7位和4位ADC的模拟CiM基线架构相比,HCiM分别实现了高达28%和12%的能耗降低。