The widespread proliferation of deep learning applications has triggered the need to accelerate them directly in hardware. General Matrix Multiplication (GEMM) kernels are elemental deep-learning constructs and they inherently map onto Systolic Arrays (SAs). SAs are regular structures that are well-suited for accelerating matrix multiplications. Typical SAs use a pipelined array of Processing Elements (PEs), which communicate with local connections and pre-orchestrated data movements. In this work, we show that the physical layout of SAs should be asymmetric to minimize wirelength and improve energy efficiency. The floorplan of the SA adjusts better to the asymmetric widths of the horizontal and vertical data buses and their switching activity profiles. It is demonstrated that such physically asymmetric SAs reduce interconnect power by 9.1% when executing state-of-the-art Convolutional Neural Network (CNN) layers, as compared to SAs of the same size but with a square (i.e., symmetric) layout. The savings in interconnect power translate, in turn, to 2.1% overall power savings.
翻译:深度学习应用的广泛普及催生了直接在硬件中加速这些应用的需求。通用矩阵乘法(GEMM)内核是深度学习的核心计算构建块,它们本质上可以映射到脉动阵列(SA)上。脉动阵列是一种规则结构,非常适合加速矩阵乘法。典型的脉动阵列使用流水线化的处理单元(PE)阵列,这些单元通过本地连接和预编排的数据移动进行通信。在本工作中,我们表明脉动阵列的物理布局应为非对称,以最小化线长并提高能效。这种布局规划能更好地适应水平和垂直数据总线的非对称宽度及其开关活动特性。实验证明,在执行最先进的卷积神经网络(CNN)层时,与相同尺寸但采用方形(即对称)布局的脉动阵列相比,这种物理非对称的脉动阵列可将互连功耗降低9.1%。互连功耗的降低进而转化为总体功耗节省2.1%。