Systolic Array (SA) architectures are well suited for accelerating matrix multiplications through the use of a pipelined array of Processing Elements (PEs) communicating with local connections and pre-orchestrated data movements. Even though most of the dynamic power consumption in SAs is due to multiplications and additions, pipelined data movement within the SA constitutes an additional important contributor. The goal of this work is to reduce the dynamic power consumption associated with the feeding of data to the SA, by synergistically applying bus-invert coding and zero-value clock gating. By exploiting salient attributes of state-of-the-art CNNs, such as the value distribution of the weights, the proposed SA applies appropriate encoding only to the data that exhibits high switching activity. Similarly, when one of the inputs is zero, unnecessary operations are entirely skipped. This selectively targeted, application-aware encoding approach is demonstrated to reduce the dynamic power consumption of data streaming in CNN applications using Bfloat16 arithmetic by 1%-19%. This translates to an overall dynamic power reduction of 6.2%-9.4%.
翻译:脉动阵列(SA)架构通过使用由处理单元(PE)组成的流水线阵列,凭借局部连接和预编排的数据移动,非常适合加速矩阵乘法运算。尽管SA中大部分动态功耗源于乘法和加法操作,但阵列内流水线化的数据移动构成了另一个重要的功耗来源。本工作的目标是协同应用总线反转编码与零值时钟门控技术,降低与向SA馈送数据相关的动态功耗。通过利用最先进卷积神经网络(CNN)的显著特性(如权重的数值分布),所提出的SA仅对表现出高翻转活动的数据进行适当编码。类似地,当某个输入为零时,完全跳过不必要的运算操作。这种选择性定向、应用感知的编码方法在使用Bfloat16算术的CNN应用中,被证明可将数据流式传输的动态功耗降低1%-19%。这对应于整体动态功耗降低6.2%-9.4%。