Convolutional Neural Networks (CNNs) are the state-of-the-art solution for many deep learning applications. For maximum scalability, their computation should combine high performance and energy efficiency. In practice, the convolutions of each CNN layer are mapped to a matrix multiplication that includes all input features and kernels of each layer and is computed using a systolic array. In this work, we focus on the design of a systolic array with configurable pipeline with the goal to select an optimal pipeline configuration for each CNN layer. The proposed systolic array, called ArrayFlex, can operate in normal, or in shallow pipeline mode, thus balancing the execution time in cycles and the operating clock frequency. By selecting the appropriate pipeline configuration per CNN layer, ArrayFlex reduces the inference latency of state-of-the-art CNNs by 11%, on average, as compared to a traditional fixed-pipeline systolic array. Most importantly, this result is achieved while using 13%-23% less power, for the same applications, thus offering a combined energy-delay-product efficiency between 1.4x and 1.8x.
翻译:卷积神经网络(CNN)是众多深度学习应用中最先进的解决方案。为获得最大可扩展性,其计算需兼顾高性能与能效。在实践中,每个CNN层的卷积运算被映射为矩阵乘法,该乘法包含该层所有输入特征与卷积核,并通过脉动阵列计算。本文聚焦于设计具有可配置流水线的脉动阵列,旨在为每个CNN层选择最优流水线配置。所提出的脉动阵列ArrayFlex可在常规模式或浅流水线模式下运行,从而平衡执行周期数与工作时钟频率。通过为每个CNN层选择适当的流水线配置,与传统的固定流水线脉动阵列相比,ArrayFlex平均将最先进CNN的推理延迟降低了11%。更重要的是,在相同应用场景下,该结果的实现同时降低了13%-23%的功耗,从而实现了1.4倍至1.8倍的能效积(Energy-Delay-Product)提升。