Transformer models have made tremendous progress in various fields in recent years. In the field of computer vision, vision transformers (ViTs) also become strong alternatives to convolutional neural networks (ConvNets), yet they have not been able to replace ConvNets since both have their own merits. For instance, ViTs are good at extracting global features with attention mechanisms while ConvNets are more efficient in modeling local relationships due to their strong inductive bias. A natural idea that arises is to combine the strengths of both ConvNets and ViTs to design new structures. In this paper, we propose a new basic neural network operator named position-aware circular convolution (ParC) and its accelerated version Fast-ParC. The ParC operator can capture global features by using a global kernel and circular convolution while keeping location sensitiveness by employing position embeddings. Our Fast-ParC further reduces the O(n2) time complexity of ParC to O(n log n) using Fast Fourier Transform. This acceleration makes it possible to use global convolution in the early stages of models with large feature maps, yet still maintains the overall computational cost comparable with using 3x3 or 7x7 kernels. The proposed operation can be used in a plug-and-play manner to 1) convert ViTs to pure-ConvNet architecture to enjoy wider hardware support and achieve higher inference speed; 2) replacing traditional convolutions in the deep stage of ConvNets to improve accuracy by enlarging the effective receptive field. Experiment results show that our ParC op can effectively enlarge the receptive field of traditional ConvNets, and adopting the proposed op benefits both ViTs and ConvNet models on all three popular vision tasks, image classification, object
翻译:近年来,Transformer模型在多个领域取得了显著进展。在计算机视觉领域,视觉Transformer(ViTs)已成为卷积神经网络(ConvNets)的有力替代方案,但由于两者各具优势,ViTs尚未能完全取代ConvNets。例如,ViTs通过注意力机制擅长提取全局特征,而ConvNets凭借强大的归纳偏差在局部关系建模方面效率更高。一个自然产生的想法是结合ConvNets与ViTs的优势设计新型结构。本文提出一种名为位置感知循环卷积(ParC)的新型基础神经网络算子及其加速版本Fast-ParC。ParC算子通过全局卷积核与循环卷积捕捉全局特征,同时利用位置嵌入保持位置敏感性。Fast-ParC进一步通过快速傅里叶变换将ParC的O(n²)时间复杂度降至O(n log n)。这一加速策略使得在大特征图模型的早期阶段使用全局卷积成为可能,同时整体计算开销仍与使用3×3或7×7卷积核相当。所提算子可即插即用地实现:1)将ViTs转换为纯卷积网络架构,以获得更广泛的硬件支持并提升推理速度;2)替换ConvNets深层阶段的传统卷积,通过扩大有效感受野提升精度。实验结果表明,ParC算子能有效扩大传统ConvNets的感受野,在图像分类、目标检测和语义分割三项主流视觉任务中,采用该算子对ViTs和ConvNet模型均有性能提升。