We revisit the relationship between attention mechanisms and large kernel ConvNets in visual transformers and propose a new spatial attention named Large Kernel Convolutional Attention (LKCA). It simplifies the attention operation by replacing it with a single large kernel convolution. LKCA combines the advantages of convolutional neural networks and visual transformers, possessing a large receptive field, locality, and parameter sharing. We explained the superiority of LKCA from both convolution and attention perspectives, providing equivalent code implementations for each view. Experiments confirm that LKCA implemented from both the convolutional and attention perspectives exhibit equivalent performance. We extensively experimented with the LKCA variant of ViT in both classification and segmentation tasks. The experiments demonstrated that LKCA exhibits competitive performance in visual tasks. Our code will be made publicly available at https://github.com/CatworldLee/LKCA.
翻译:我们重新审视了视觉Transformer中注意力机制与大核卷积网络之间的关系,并提出了一种名为大核卷积注意力(Large Kernel Convolutional Attention, LKCA)的新空间注意力方法。该方法通过将注意力操作替换为单个大核卷积来简化其实现。LKCA结合了卷积神经网络与视觉Transformer的优势,具备大感受野、局部性和参数共享特性。我们从卷积和注意力两个角度阐释了LKCA的优越性,并提供了每种视角下的等价代码实现。实验证实,从卷积和注意力两个角度实现的LKCA性能完全一致。我们在分类和分割任务中对ViT的LKCA变体进行了广泛实验,结果表明LKCA在视觉任务中展现出具有竞争力的性能。我们的代码将在https://github.com/CatworldLee/LKCA 公开提供。