Transformers have achieved tremendous success in various computer vision tasks. By borrowing design concepts from transformers, many studies revolutionized CNNs and showed remarkable results. This paper falls in this line of studies. More specifically, we introduce a convolutional neural network architecture named ParCNetV2, which extends position-aware circular convolution (ParCNet) with oversized convolutions and strengthens attention through bifurcate gate units. The oversized convolution utilizes a kernel with $2\times$ the input size to model long-range dependencies through a global receptive field. Simultaneously, it achieves implicit positional encoding by removing the shift-invariant property from convolutional kernels, i.e., the effective kernels at different spatial locations are different when the kernel size is twice as large as the input size. The bifurcate gate unit implements an attention mechanism similar to self-attention in transformers. It splits the input into two branches, one serves as feature transformation while the other serves as attention weights. The attention is applied through element-wise multiplication of the two branches. Besides, we introduce a unified local-global convolution block to unify the design of the early and late stage convolutional blocks. Extensive experiments demonstrate that our method outperforms other pure convolutional neural networks as well as neural networks hybridizing CNNs and transformers.
翻译:Transformer在各类计算机视觉任务中取得了巨大成功。通过借鉴Transformer的设计理念,许多研究革新了卷积神经网络并展现了显著成果。本文属于该类研究范畴。具体而言,我们提出了一种名为ParCNetV2的卷积神经网络架构,该架构通过超大尺寸卷积扩展了位置感知圆形卷积(ParCNet),并利用分支门控单元强化注意力机制。超大尺寸卷积使用尺寸为输入尺寸两倍的卷积核,通过全局感受野建模长距离依赖关系。同时,该机制通过移除卷积核的平移不变性实现隐式位置编码——当卷积核尺寸为输入尺寸两倍时,不同空间位置的有效卷积核存在差异。分支门控单元实现了类似Transformer中自注意力机制的注意力机制:它将输入分为两个分支,一个分支负责特征变换,另一个分支提供注意力权重,通过逐元素乘法实现注意力加权。此外,我们引入统一的局部-全局卷积块,统一了网络早期与后期阶段卷积块的设计。大量实验表明,我们的方法优于纯卷积神经网络以及CNN与Transformer混合的神经网络。