Fusing Depthwise and Pointwise Convolutions for Efficient Inference on GPUs

Depthwise and pointwise convolutions have fewer parameters and perform fewer operations than standard convolutions. As a result, they have become increasingly used in various compact DNNs, including convolutional neural networks (CNNs) and vision transformers (ViTs). However, they have a lower compute-to-memory-access ratio than standard convolutions, making their memory accesses often the performance bottleneck. This paper explores fusing depthwise and pointwise convolutions to overcome the memory access bottleneck. The focus is on fusing these operators on GPUs. The prior art on GPU-based fusion suffers from one or more of the following: (1) fusing either a convolution with an element-wise or multiple non-convolutional operators, (2) not explicitly optimizing for memory accesses, (3) not supporting depthwise convolutions. This paper proposes Fused Convolutional Modules (FCMs), a set of novel fused depthwise and pointwise GPU kernels. FCMs significantly reduce pointwise and depthwise convolutions memory accesses, improving execution time and energy efficiency. To evaluate the trade-offs associated with fusion and determine which convolutions are beneficial to fuse and the optimal FCM parameters, we propose FusePlanner. FusePlanner consists of cost models to estimate the memory accesses of depthwise, pointwise, and FCM kernels given GPU characteristics. Our experiments on three GPUs using representative CNNs and ViTs demonstrate that FCMs save up to 83% of the memory accesses and achieve speedups of up to 3.7x compared to cuDNN. Complete model implementations of various CNNs using our modules outperform TVMs' achieving speedups of up to 1.8x and saving up to two-thirds of the energy.

翻译：深度可分离卷积与逐点卷积相比标准卷积具有更少的参数和计算量，因而被广泛应用于各类紧凑型深度神经网络（DNN），包括卷积神经网络（CNN）和视觉Transformer（ViT）。然而，其计算量与内存访问量之比低于标准卷积，导致内存访问常成为性能瓶颈。本文探索通过融合深度可分离卷积与逐点卷积来突破内存访问瓶颈，重点研究在GPU上实现算子融合。现有基于GPU的融合技术存在以下一个或多个问题：（1）仅融合卷积与逐元素算子或多个非卷积算子；（2）未显式优化内存访问；（3）不支持深度可分离卷积。本文提出融合卷积模块（FCM），这是一种新型深度可分离卷积与逐点卷积的GPU融合内核。FCM显著减少逐点卷积和深度可分离卷积的内存访问量，从而提升执行时间与能效。为了评估融合的权衡关系、确定哪些卷积适合融合以及最优FCM参数，我们提出FusePlanner框架。该框架包含成本模型，可基于GPU特性估算深度可分离卷积、逐点卷积及FCM内核的内存访问量。在三个GPU上使用代表性CNN和ViT进行的实验表明：与cuDNN相比，FCM最多可节省83%的内存访问量，加速比最高达3.7倍；采用本模块实现多种完整CNN模型后，性能较TVM实现提升高达1.8倍，同时节省三分之二的能耗。