Compute-in-memory (CIM) based neural network accelerators offer a promising solution to the Von Neumann bottleneck by computing directly within memory arrays. However, SRAM CIM faces limitations in executing larger models due to its cell size and on-chip memory constraints. This work proposes CIMPool, a CIM-aware compression and acceleration framework that counters this limitation through a weight sharing-based compression technique, aptly named `Weight Pool,' enabling significantly larger neural networks to be accommodated within on-chip memory constraints. This method minimizes the accuracy trade-off typically associated with parameter compression, allowing CIMPool to achieve a significantly larger compression ratio compared to the traditional quantization method with iso-accuracy. Furthermore, CIMPool co-optimizes the compression algorithm, hardware, and dataflow to efficiently implement the hardware permutation required by weight pool compression, with negligible area and throughput overhead. Empirical results demonstrate that CIMPool can achieve 8-bit level accuracy with an effective 0.5-bit precision, reduce chip area by 62.3% for ResNet-18, and enable the execution of an order of magnitude larger models for a given area budget in SRAM CIMs. When DRAM is used to store weights, CIMPool can reduce the total energy by 3.24x compared to iso-accuracy traditional CIMs.
翻译:基于存内计算(CIM)的神经网络加速器通过在存储阵列内部直接进行计算,为解决冯·诺依曼瓶颈提供了有前景的解决方案。然而,由于存储单元尺寸和片上存储容量的限制,SRAM CIM在执行较大模型时面临局限。本文提出CIMPool,一种具备CIM感知能力的压缩与加速框架,通过基于权重共享的压缩技术(恰当地命名为“权重池”)来应对这一限制,使得显著更大的神经网络能够被容纳在片上存储约束内。该方法最大限度地减少了通常与参数压缩相关的精度损失,使CIMPool在同等精度条件下,相比传统量化方法能够实现显著更高的压缩比。此外,CIMPool协同优化了压缩算法、硬件架构与数据流,以高效实现权重池压缩所需的硬件置换,其带来的面积与吞吐开销可忽略不计。实验结果表明,CIMPool能以0.5比特的有效精度实现8比特级别的准确度,在ResNet-18上将芯片面积减少62.3%,并在给定的面积预算下,使SRAM CIM能够执行规模大一个数量级的模型。当使用DRAM存储权重时,与同等精度的传统CIM相比,CIMPool可将总能耗降低3.24倍。