Edge computing is a promising solution for handling high-dimensional, multispectral analog data from sensors and IoT devices for applications such as autonomous drones. However, edge devices' limited storage and computing resources make it challenging to perform complex predictive modeling at the edge. Compute-in-memory (CiM) has emerged as a principal paradigm to minimize energy for deep learning-based inference at the edge. Nevertheless, integrating storage and processing complicates memory cells and/or memory peripherals, essentially trading off area efficiency for energy efficiency. This paper proposes a novel solution to improve area efficiency in deep learning inference tasks. The proposed method employs two key strategies. Firstly, a Frequency domain learning approach uses binarized Walsh-Hadamard Transforms, reducing the necessary parameters for DNN (by 87% in MobileNetV2) and enabling compute-in-SRAM, which better utilizes parallelism during inference. Secondly, a memory-immersed collaborative digitization method is described among CiM arrays to reduce the area overheads of conventional ADCs. This facilitates more CiM arrays in limited footprint designs, leading to better parallelism and reduced external memory accesses. Different networking configurations are explored, where Flash, SA, and their hybrid digitization steps can be implemented using the memory-immersed scheme. The results are demonstrated using a 65 nm CMOS test chip, exhibiting significant area and energy savings compared to a 40 nm-node 5-bit SAR ADC and 5-bit Flash ADC. By processing analog data more efficiently, it is possible to selectively retain valuable data from sensors and alleviate the challenges posed by the analog data deluge.
翻译:边缘计算是处理来自传感器与物联网设备的高维多光谱模拟数据(如自主无人机应用)的一种有前景的解决方案。然而,边缘设备有限的存储和计算资源使得在边缘侧执行复杂的预测模型面临挑战。存内计算(CiM)已成为减少边缘端基于深度学习的推理能耗的主流范式。然而,存储与处理功能的集成会复杂化存储单元及/或外围电路,本质上是以面积效率换取能量效率。本文提出一种新型解决方案以提升深度学习推理任务的面积效率。该方法采用两项关键策略:其一,频域学习方法利用二值化Walsh-Hadamard变换,减少深度神经网络所需参数(在MobileNetV2中降低87%),并实现存内SRAM计算,从而在推理过程中更好地利用并行性;其二,提出一种面向CiM阵列的存储嵌入式协作数字化方法,以减少传统模数转换器的面积开销。这有助于在有限面积设计中集成更多CiM阵列,从而实现更好的并行性并减少对外部存储器的访问。本文探索了不同的网络配置,其中Flash型、逐次逼近型及其混合型数字化步骤均可通过存储嵌入式方案实现。基于65 nm CMOS测试芯片的验证结果表明,相较于40 nm节点的5位逐次逼近型模数转换器与5位Flash型模数转换器,该方法实现了显著的面积与能耗节约。通过更高效地处理模拟数据,能够选择性保留传感器中的有价值信息,从而缓解模拟数据洪流带来的挑战。