The edge processing of deep neural networks (DNNs) is becoming increasingly important due to its ability to extract valuable information directly at the data source to minimize latency and energy consumption. Frequency-domain model compression, such as with the Walsh-Hadamard transform (WHT), has been identified as an efficient alternative. However, the benefits of frequency-domain processing are often offset by the increased multiply-accumulate (MAC) operations required. This paper proposes a novel approach to an energy-efficient acceleration of frequency-domain neural networks by utilizing analog-domain frequency-based tensor transformations. Our approach offers unique opportunities to enhance computational efficiency, resulting in several high-level advantages, including array micro-architecture with parallelism, ADC/DAC-free analog computations, and increased output sparsity. Our approach achieves more compact cells by eliminating the need for trainable parameters in the transformation matrix. Moreover, our novel array micro-architecture enables adaptive stitching of cells column-wise and row-wise, thereby facilitating perfect parallelism in computations. Additionally, our scheme enables ADC/DAC-free computations by training against highly quantized matrix-vector products, leveraging the parameter-free nature of matrix multiplications. Another crucial aspect of our design is its ability to handle signed-bit processing for frequency-based transformations. This leads to increased output sparsity and reduced digitization workload. On a 16$\times$16 crossbars, for 8-bit input processing, the proposed approach achieves the energy efficiency of 1602 tera operations per second per Watt (TOPS/W) without early termination strategy and 5311 TOPS/W with early termination strategy at VDD = 0.8 V.
翻译:深度神经网络(DNNs)的边缘处理因其能在数据源直接提取有价值信息以最小化延迟和能耗而日益重要。频域模型压缩(如采用沃尔什-哈达玛变换)已被证明是一种高效的替代方案。然而,频域处理带来的增益常因所需的乘累加(MAC)操作增加而被抵消。本文提出了一种新颖的节能加速频域神经网络方法,通过利用模拟域基于频率的张量变换实现。该方法在提升计算效率方面具有独特优势,带来多项高阶性能改进,包括具备并行性的阵列微架构、无ADC/DAC的模拟计算以及增强的输出稀疏性。通过消除变换矩阵中可训练参数的需求,该方法实现了更紧凑的单元结构。此外,新型阵列微架构支持按列和按行自适应拼接单元,从而完美实现计算并行化。同时,该方案通过对高度量化的矩阵向量乘积进行训练,利用矩阵乘法无参数化的特性,实现了无ADC/DAC计算。设计的另一关键特性是能处理基于频率变换的有符号位运算,从而提升输出稀疏性并降低数字化负载。在16×16交叉阵列上处理8位输入时,所提方法在VDD=0.8V条件下,无早停策略时实现1602 TOPS/W能效,采用早停策略时达到5311 TOPS/W。