Deep neural networks rely on parallel processors for acceleration. To design operators for them, it requires not only good algorithm to reduce complexity, but also sufficient utilization of hardwares. Convolutional layers mainly contain 3 kinds of operators: convolution in forward propagation, deconvolution and dilated-convolution in backward propagation. When executing these operators, 0s are always added to tensors, causing redundant calculations. This paper gives C-K-S algorithm (ConvV2, KS-deconv, Sk-dilated), which skips these 0s in two ways: trim the filters to exclude padded 0s; transform sparse tensors to dense tensors, to avoid inserted 0s in deconvolution and dilated-convolution. In contrast to regular convolution, deconvolution is hard to accelerate due to its complicacy. This paper provides high-performance GPU implementations of C-K-S, and verifies their effectiveness with comparison to PyTorch. According to the experiments, C-K-S has advantages over PyTorch in certain cases, especially in deconvolution on small feature-maps. Further enhancement of C-K-S can be done by making full optimizations oriented at specific GPU architectures.
翻译:深度神经网络依赖并行处理器实现加速。要为其设计算子,不仅需要优秀的算法来降低复杂度,还需要充分利用硬件。卷积层主要包含三种算子:前向传播中的卷积、反向传播中的反卷积和空洞卷积。在执行这些算子时,张量中经常被填充零值,导致冗余计算。本文提出C-K-S算法(ConvV2、KS-deconv、Sk-dilated),通过两种方式跳过这些零值:修剪滤波器以排除填充的零值;将稀疏张量转换为稠密张量,以避免反卷积和空洞卷积中的插入零值。与常规卷积相比,反卷积因其复杂性而难以加速。本文提供了C-K-S的高性能GPU实现,并通过与PyTorch的对比验证了其有效性。实验表明,C-K-S在某些情况下优于PyTorch,尤其是在小特征图上的反卷积任务中。通过针对特定GPU架构进行充分优化,可进一步提升C-K-S的性能。