Analog compute-in-memory (CIM) in static random-access memory (SRAM) is promising for accelerating deep learning inference by circumventing the memory wall and exploiting ultra-efficient analog low-precision arithmetic. Latest analog CIM designs attempt bit-parallel schemes for multi-bit analog Matrix-Vector Multiplication (MVM), aiming at higher energy efficiency, throughput, and training simplicity and robustness over conventional bit-serial methods that digitally shift-and-add multiple partial analog computing results. However, bit-parallel operations require more complex analog computations and become more sensitive to well-known analog CIM challenges, including large cell areas, inefficient and inaccurate multi-bit analog operations, and vulnerability to PVT variations. This paper presents PICO-RAM, a PVT-insensitive and compact CIM SRAM macro with charge-domain bit-parallel computation. It adopts a multi-bit thin-cell Multiply-Accumulate (MAC) unit that shares the same transistor layout as the most compact 6T SRAM cell. All analog computing modules, including digital-to-analog converters (DACs), MAC units, analog shift-and-add, and analog-to-digital converters (ADCs) reuse one set of local capacitors inside the array, performing in-situ computation to save area and enhance accuracy. A compact 8.5-bit dual-threshold time-domain ADC power gates the main path most of the time, leading to a significant energy reduction. Our 65-nm prototype achieves the highest weight storage density of 559 Kb/mm${^2}$ and exceptional robustness to temperature and voltage variations (-40 to 105 $^{\circ}$C and 0.65 to 1.2 V) among SRAM-based analog CIM designs.
翻译:静态随机存取存储器(SRAM)中的模拟存内计算(CIM)通过规避内存墙并利用超高效的模拟低精度运算,有望加速深度学习推理。最新的模拟CIM设计尝试采用比特并行方案来实现多比特模拟矩阵向量乘法(MVM),旨在比传统的比特串行方法(即对多个部分模拟计算结果进行数字移位相加)获得更高的能效、吞吐量,以及更简单和鲁棒的训练过程。然而,比特并行操作需要更复杂的模拟计算,并且对模拟CIM中众所周知的挑战(包括大单元面积、低效且不精确的多比特模拟操作以及对PVT变化的敏感性)变得更加敏感。本文提出了PICO-RAM,一种采用电荷域比特并行计算的PVT不敏感且紧凑的CIM SRAM宏。它采用了一个多比特薄单元乘累加(MAC)单元,该单元与最紧凑的6T SRAM单元共享相同的晶体管版图。所有模拟计算模块,包括数模转换器(DAC)、MAC单元、模拟移位相加以及模数转换器(ADC),都复用阵列内的一组本地电容器,执行原位计算以节省面积并提高精度。一个紧凑的8.5比特双阈值时域ADC在大部分时间内对主路径进行功率门控,从而显著降低了能耗。我们的65纳米原型在基于SRAM的模拟CIM设计中,实现了559 Kb/mm${^2}$的最高权重存储密度,并对温度和电压变化(-40至105 $^{\circ}$C和0.65至1.2 V)表现出卓越的鲁棒性。