Systolic Array (SA) architectures are well suited for accelerating matrix multiplications through the use of a pipelined array of Processing Elements (PEs) communicating with local connections and pre-orchestrated data movements. Even though most of the dynamic power consumption in SAs is due to multiplications and additions, pipelined data movement within the SA constitutes an additional important contributor. The goal of this work is to reduce the dynamic power consumption associated with the feeding of data to the SA, by synergistically applying bus-invert coding and zero-value clock gating. By exploiting salient attributes of state-of-the-art CNNs, such as the value distribution of the weights, the proposed SA applies appropriate encoding only to the data that exhibits high switching activity. Similarly, when one of the inputs is zero, unnecessary operations are entirely skipped. This selectively targeted, application-aware encoding approach is demonstrated to reduce the dynamic power consumption of data streaming in CNN applications using Bfloat16 arithmetic by 1%-19%. This translates to an overall dynamic power reduction of 6.2%-9.4%.
翻译:脉动阵列架构通过使用由处理单元组成的流水线阵列,借助本地连接与预编排的数据移动来加速矩阵乘法运算。尽管脉动阵列中的动态功耗主要来自乘法和加法操作,但阵列内部的流水线数据移动同样是重要功耗来源。本文旨在通过协同应用总线反转编码与零值时钟门控技术,降低与数据输入相关的动态功耗。通过利用先进卷积神经网络的特征(如权重的数值分布),所提出的脉动阵列仅对具有高翻转率的数据进行编码。同时,当任一输入为零时,相应运算将被完全跳过。实验表明,这种基于应用感知的针对性编码方法在使用Bfloat16算术的卷积神经网络应用中,可将数据流传输的动态功耗降低1%-19%,进而实现整体动态功耗6.2%-9.4%的降低。