With the ever-growing popularity of Artificial Intelligence, there is an increasing demand for more performant and efficient underlying hardware. Convolutional Neural Networks (CNN) are a workload of particular importance, which achieve high accuracy in computer vision applications. Inside CNNs, a significant number of the post-activation values are zero, resulting in many redundant computations. Recent works have explored this post-activation sparsity on instruction-based CNN accelerators but not on streaming CNN accelerators, despite the fact that streaming architectures are considered the leading design methodology in terms of performance. In this paper, we highlight the challenges associated with exploiting post-activation sparsity for performance gains in streaming CNN accelerators, and demonstrate our approach to address them. Using a set of modern CNN benchmarks, our streaming sparse accelerators achieve 1.41x to 1.93x efficiency (GOP/s/DSP) compared to state-of-the-art instruction-based sparse accelerators.
翻译:随着人工智能的日益普及,对更高性能和效率的底层硬件需求日益增长。卷积神经网络(CNN)是一类尤为重要的工作负载,在计算机视觉应用中实现了高精度。在CNN内部,大量激活后值为零,导致许多冗余计算。尽管流式架构被认为是性能方面的领先设计方法,但近期研究主要在基于指令的CNN加速器上探索这种激活后稀疏性,而非流式CNN加速器。本文重点阐述了在流式CNN加速器中利用激活后稀疏性提升性能所面临的挑战,并展示了我们应对这些挑战的方法。通过一组现代CNN基准测试,我们的流式稀疏加速器相比最先进的基于指令的稀疏加速器实现了1.41倍至1.93倍的效率提升(GOP/s/DSP)。