The acceleration of deep-learning kernels in hardware relies on matrix multiplications that are executed efficiently on Systolic Arrays (SA). To effectively trade off deep-learning training/inference quality with hardware cost, SA accelerators employ reduced-precision Floating-Point (FP) arithmetic. In this work, we demonstrate the need for new pipeline organizations to reduce latency and improve energy efficiency of reduced-precision FP operators for the chained multiply-add operation imposed by the structure of the SA. The proposed skewed pipeline design reorganizes the pipelined operation of the FP multiply-add units to enable new forwarding paths for the exponent logic, which allow for parallel execution of the pipeline stages of consecutive PEs. As a result, the latency of the matrix multiplication operation within the SA is significantly reduced with minimal hardware cost, thereby yielding an energy reduction of 8% and 11% for the examined state-of-the-art CNNs.
翻译:深度学习核心在硬件上的加速依赖于在脉动阵列(SA)上高效执行的矩阵乘法。为了有效权衡深度学习训练/推理质量与硬件成本,SA加速器采用低精度浮点(FP)算术。本研究证明,针对SA结构所施加的链式乘加操作,需要新的流水线组织方式来降低低精度FP运算器的延迟并提升能效。所提出的倾斜流水线设计重新组织FP乘加单元的流水线操作,为指数逻辑开辟新的前向路径,从而允许相邻处理单元(PE)流水线级并行执行。最终,SA内矩阵乘法运算的延迟显著降低,且硬件开销极小,对于所研究的最先进CNN,能效分别提升8%和11%。