As Convolutional Neural Networks (CNNs) become increasingly prevalent in deep learning applications, numerous algorithms, such as the Winograd algorithm, have been proposed to enhance their efficiency. However, existing implementations of Winograd Convolution based on General Matrix Multiplication (GEMM) exhibit certain limitations: the transformation tasks take up a significant portion of the process, computation efficiency is suboptimal, and a single parallel strategy leads to reduced parallel efficiency for certain layers. In this article, we present a novel fused Winograd Convolution algorithm specifically optimized for the three stages of Winograd Convolution - input and filter transformation, computation, and output transformation - carefully tailored for ARMv8 manycore CPUs. Our method maintains consecutive memory access as much as possible during the transformation stage and integrates data packing into a z-shape customized data layout, which is conducive for our meticulously optimized GEMM micro-kernel using a ping-pong technique. Moreover, we introduce a three-mode parallel strategy that adaptively switches based on the scale of convolutional layers, addressing the shortcomings of current methodologies. By manually optimizing each kernel at the assembly level and thoroughly analyzing the blocking parameters, we significantly reduce transformation time and enhance computational efficiency compared to state-of-the-art libraries. Experimental results demonstrate that our method achieves up to 2.35x and 2.39x speedup for single-thread execution and 1.66x and 2.06x geometric mean speedup for multi-thread execution compared to NCNN and NNPACK on the Kunpeng 920.
翻译:随着卷积神经网络在深度学习应用中的日益普及,诸多算法(如Winograd算法)被提出以提升其计算效率。然而,现有基于通用矩阵乘法的Winograd卷积实现存在若干局限性:变换任务在整体流程中占据显著比重,计算效率未达最优,且单一的并行策略导致部分网络层的并行效率降低。本文提出一种新颖的融合式Winograd卷积算法,专门针对Winograd卷积的三个阶段——输入与滤波器变换、计算及输出变换——进行优化,并针对ARMv8众核CPU架构进行精细适配。该方法在变换阶段尽可能保持连续内存访问,并将数据打包整合为有利于微内核优化的Z形定制数据布局,该布局与我们采用乒乓技术精心优化的GEMM微内核高度契合。此外,我们提出一种三模式并行策略,可根据卷积层规模自适应切换,以解决现有方法的不足。通过汇编级手动优化各计算内核并系统分析分块参数,相较于主流计算库,我们的方法显著降低了变换时间并提升了计算效率。实验结果表明,在鲲鹏920处理器上,相比NCNN与NNPACK库,我们的方法在单线程执行中最高可获得2.35倍与2.39倍加速,在多线程执行中几何平均加速比分别达到1.66倍与2.06倍。