As Convolutional Neural Networks (CNNs) gain prominence in deep learning, algorithms like Winograd Convolution have been introduced to enhance computational efficiency. However, existing implementations often face challenges such as high transformation overhead, suboptimal computation efficiency, and reduced parallel performance in some layers. We propose a fused Winograd Convolution algorithm optimized for ARMv8 CPUs, integrating input transformation, filter transformation, computation, and output transformation into a single pipeline. By maintaining consecutive memory access and using a custom z-shaped data layout, our approach fully utilizes an optimized GEMM micro-kernel with a ping-pong technique. Additionally, we introduce a multi-dimensional parallel strategy that adapts to convolutional layer scales. To maximize performance, we manually optimize each kernel in AArch64 assembly and carefully tune blocking parameters. Experimental results show speedups of up to 4.74x, 4.10x, 4.72x, and 10.57x over NCNN, NNPACK, FastConv, and ACL on the Kunpeng 920 platform using multiple threads, with respective gains of 3.85x, 2.81x, 4.20x, and 7.80x on the AWS Graviton2, and 3.32x, 3.68x, 8.00x, and 9.28x on the Phytium 2000+.
翻译:随着卷积神经网络在深度学习中日益重要,诸如Winograd卷积等算法被引入以提升计算效率。然而,现有实现常面临高转换开销、计算效率欠佳以及部分层并行性能下降等挑战。本文提出一种针对ARMv8 CPU优化的融合Winograd卷积算法,将输入变换、滤波器变换、计算和输出变换集成至单一流水线。通过保持连续内存访问并采用自定义的Z形数据布局,该方法充分利用了采用乒乓技术的优化GEMM微内核。此外,我们引入适应卷积层规模的多维并行策略。为最大化性能,我们以AArch64汇编手动优化各内核并精细调整分块参数。实验结果表明,在鲲鹏920平台上使用多线程时,相比NCNN、NNPACK、FastConv和ACL分别实现最高4.74倍、4.10倍、4.72倍和10.57倍的加速;在AWS Graviton2上分别获得3.85倍、2.81倍、4.20倍和7.80倍的性能提升;在飞腾2000+上则分别达到3.32倍、3.68倍、8.00倍和9.28倍的加速效果。