The acceleration of deep-learning kernels in hardware relies on matrix multiplications that are executed efficiently on Systolic Arrays (SA). To effectively trade off deep-learning training/inference quality with hardware cost, SA accelerators employ reduced-precision Floating-Point (FP) arithmetic. In this work, we demonstrate the need for new pipeline organizations to reduce latency and improve energy efficiency of reduced-precision FP operators for the chained multiply-add operation imposed by the structure of the SA. The proposed skewed pipeline design reorganizes the pipelined operation of the FP multiply-add units to enable new forwarding paths for the exponent logic, which allow for parallel execution of the pipeline stages of consecutive PEs. As a result, the latency of the matrix multiplication operation within the SA is significantly reduced with minimal hardware cost, thereby yielding an energy reduction of 8% and 11% for the examined state-of-the-art CNNs.
翻译:深度学习核心硬件加速依赖于在脉动阵列(Systolic Array, SA)上高效执行的矩阵乘法。为在深度学习训练/推理质量与硬件成本之间实现有效权衡,脉动阵列加速器采用低精度浮点(Floating-Point, FP)运算。本工作揭示了在脉动阵列结构所强制的链式乘加操作中,需采用新型流水线组织结构以降低低精度浮点运算器的延迟并提升能效。本文提出的偏斜流水线设计通过重构浮点乘加单元的流水线操作,为指数逻辑开辟了新的前向路径,从而使得相邻处理单元(Processing Element, PE)的流水线阶段能够并行执行。由此,脉动阵列内矩阵乘法运算的延迟在极低硬件成本下显著降低,进而使所评估的当代卷积神经网络(CNN)的能耗分别降低8%和11%。