Crossbar-based in-memory computing (IMC) has emerged as a promising platform for hardware acceleration of deep neural networks (DNNs). However, the energy and latency of IMC systems are dominated by the large overhead of the peripheral analog-to-digital converters (ADCs). To address such ADC bottleneck, here we propose to implement stochastic processing of array-level partial sums (PS) for efficient IMC. Leveraging the probabilistic switching of spin-orbit torque magnetic tunnel junctions, the proposed PS processing eliminates the costly ADC, achieving significant improvement in energy and area efficiency. To mitigate accuracy loss, we develop PS-quantization-aware training that enables backward propagation across stochastic PS. Furthermore, a novel scheme with an inhomogeneous sampling length of the stochastic conversion is proposed. When running ResNet20 on the CIFAR-10 dataset, our architecture-to-algorithm co-design demonstrates up to 16x, 8x, and 10x improvement in energy, latency, and area, respectively, compared to IMC with standard ADC. Our optimized design configuration using stochastic PS achieved 130x (24x) improvement in Energy-Delay-Product compared to IMC with full precision ADC (sparse low-bit ADC), while maintaining near-software accuracy at various benchmark classification tasks.
翻译:基于交叉阵列的存内计算已成为深度神经网络硬件加速的一种有前景的平台。然而,存内计算系统的能耗与延迟主要受外围模数转换器的高额开销所主导。为应对此类ADC瓶颈,本文提出在阵列层面实现部分和的随机处理以实现高效存内计算。该方法利用自旋轨道矩磁性隧道结的概率性翻转特性,所提出的部分和处理方案消除了昂贵的ADC,在能效与面积效率上实现了显著提升。为减少精度损失,我们开发了部分和量化感知训练技术,使其能够跨越随机部分和执行反向传播。此外,本文还提出了一种采用非均匀采样长度的随机转换新方案。在CIFAR-10数据集上运行ResNet20时,我们的架构-算法协同设计与采用标准ADC的存内计算方案相比,在能耗、延迟和面积上分别实现了最高16倍、8倍和10倍的提升。采用随机部分和的优化设计配置,与全精度ADC(稀疏低比特ADC)的存内计算方案相比,其能量延迟积提升了130倍(24倍),同时在多项基准分类任务中保持了接近软件实现的精度。