3D point cloud neural networks have significantly enhanced the perceptual capabilities of resource-limited mobile intelligent systems. However, despite the transformative impact, the point cloud algorithm suffers from substantial memory access during data preprocessing and imposes a burdensome workload on feature computing, resulting in high energy consumption and latency. In this paper, an efficient SRAM-based computing-in-memory (SRAM-CIM) accelerator (PC2IM), is proposed to alleviate memory access bottlenecks in point-based 3D point cloud networks. A data preprocessing module driven by the customized CIM engines is proposed and incorporated into a memory-efficient data flow. Specifically, an approximate distance SRAM-CIM (APD-CIM) is introduced to eliminate the repetitive on-chip memory access for point clouds that are spatially partitioned by the median and reduce the volume of temporary distance data. Building on the APD-CIM, a two-level Ping-Pong-MAX Content Addressable Memory (Ping-Pong-MAX CAM) is introduced to adaptively update temporary distances and perform in-situ search for the maximum, further reducing memory access. Additionally, an efficient CIM-based feature computing engine, named split-concatenate SRAM-CIM, is presented to minimize computation latency in multi-layer perceptron with high-precision input, while maintaining high area and energy efficiency. Experiment results show that the proposed PC2IM demonstrates 1.5x speedup and 2.7x enhanced energy efficiency compared to state-of-the-art point cloud accelerator. Moreover, PC2IM achieves 3.5x speedup and 1518.9x enhanced energy efficiency compared to GPU implementations.
翻译:3D点云神经网络显著增强了资源受限移动智能系统的感知能力。然而,尽管产生了革命性影响,点云算法在数据预处理过程中仍存在大量内存访问,并为特征计算带来沉重负担,导致高能耗和高延迟。本文提出了一种基于SRAM的高效存内计算加速器PC2IM,旨在缓解基于点的3D点云网络中的内存访问瓶颈。我们提出了一种由定制化CIM引擎驱动的数据预处理模块,并将其集成到内存高效的数据流中。具体而言,引入了一种近似距离SRAM-CIM(APD-CIM),以消除通过中位数进行空间划分的点云芯片内重复内存访问,并减少临时距离数据量。在APD-CIM基础上,引入了两级乒乓-最大内容可寻址存储器(Ping-Pong-MAX CAM),用于自适应更新临时距离并原位搜索最大值,从而进一步减少内存访问。此外,提出了一种高效的基于CIM的特征计算引擎——分离-拼接SRAM-CIM,以最小化高精度输入多层感知器的计算延迟,同时保持高面积和能效。实验结果表明,与最先进的点云加速器相比,所提出的PC2IM实现了1.5倍的加速和2.7倍的能效提升。此外,与GPU实现相比,PC2IM实现了3.5倍的加速和1518.9倍的能效提升。