Convolution is the most time-consuming operation in deep neural network operations, so its performance is critical to the overall performance of the neural network. The commonly used methods for convolution on GPU include the general matrix multiplication (GEMM)-based convolution and the direct convolution. GEMM-based convolution relies on the im2col algorithm, which results in a large memory footprint and reduced performance. Direct convolution does not have the large memory footprint problem, but the performance is not on par with GEMM-based approach because of the discontinuous memory access. This paper proposes a window-order-based convolution paradigm on GPU, called im2win, which not only reduces memory footprint but also offers continuous memory accesses, resulting in improved performance. Furthermore, we apply a range of optimization techniques on the convolution CUDA kernel, including shared memory, tiling, micro-kernel, double buffer, and prefetching. We compare our implementation with the direct convolution, and PyTorch's GEMM-based convolution with cuBLAS and six cuDNN-based convolution implementations, with twelve state-of-the-art DNN benchmarks. The experimental results show that our implementation 1) uses less memory footprint by 23.1% and achieves 3.5$\times$ TFLOPS compared with cuBLAS, 2) uses less memory footprint by 32.8% and achieves up to 1.8$\times$ TFLOPS compared with the best performant convolutions in cuDNN, and 3) achieves up to 155$\times$ TFLOPS compared with the direct convolution. We further perform an ablation study on the applied optimization techniques and find that the micro-kernel has the greatest positive impact on performance.
翻译:卷积是深度神经网络操作中最耗时的运算,其性能对神经网络整体表现至关重要。当前GPU上常用的卷积方法包括基于通用矩阵乘法(GEMM)的卷积和直接卷积。基于GEMM的卷积依赖im2col算法,导致内存占用大且性能下降;直接卷积虽无高内存占用问题,但由于内存访问不连续,性能不及基于GEMM的方法。本文提出一种基于窗口顺序的GPU卷积范式im2win,它既降低了内存占用,又实现了连续内存访问,从而提升了性能。此外,我们在卷积CUDA内核上应用了多种优化技术,包括共享内存、分块、微内核、双缓冲和预取。我们将本实现与直接卷积、PyTorch基于cuBLAS的GEMM卷积以及六种基于cuDNN的卷积实现进行了比较,采用十二个前沿深度神经网络基准测试。实验结果显示:1)相较于cuBLAS,本实现内存占用减少23.1%,TFLOPS提升3.5倍;2)相较于cuDNN中性能最优的卷积,内存占用减少32.8%,TFLOPS提升达1.8倍;3)相较于直接卷积,TFLOPS提升达155倍。我们进一步对优化技术进行了消融研究,发现微内核技术对性能提升的正面影响最大。