Vision transformers have shown great success on numerous computer vision tasks. However, its central component, softmax attention, prohibits vision transformers from scaling up to high-resolution images, due to both the computational complexity and memory footprint being quadratic. Although linear attention was introduced in natural language processing (NLP) tasks to mitigate a similar issue, directly applying existing linear attention to vision transformers may not lead to satisfactory results. We investigate this problem and find that computer vision tasks focus more on local information compared with NLP tasks. Based on this observation, we present a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity. Specifically, for each image patch, we adjust its attention weight based on its 2D Manhattan distance measured by its neighbouring patches. In this case, the neighbouring patches will receive stronger attention than far-away patches. Moreover, since our Vicinity Attention requires the token length to be much larger than the feature dimension to show its efficiency advantages, we further propose a new Vicinity Vision Transformer (VVT) structure to reduce the feature dimension without degenerating the accuracy. We perform extensive experiments on the CIFAR100, ImageNet1K, and ADE20K datasets to validate the effectiveness of our method. Our method has a slower growth rate of GFlops than previous transformer-based and convolution-based networks when the input resolution increases. In particular, our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous methods.
翻译:视觉变换器已在众多计算机视觉任务中取得巨大成功。然而,其核心组件softmax注意力机制因计算复杂度和内存占用均呈二次增长,阻碍了视觉变换器向高分辨率图像的扩展。尽管自然语言处理(NLP)任务中已引入线性注意力以缓解类似问题,但将现有线性注意力直接应用于视觉变换器可能无法获得令人满意的结果。我们对此问题展开研究,发现计算机视觉任务相比NLP任务更关注局部信息。基于这一观察,我们提出了邻域注意力(Vicinity Attention),该机制以线性复杂度为视觉变换器引入了局部性偏置。具体而言,对每个图像块,我们根据其邻近图像块测得的二维曼哈顿距离调整其注意力权重。如此一来,邻近图像块将比远处图像块获得更强的注意力权重。此外,由于我们的邻域注意力需要令牌长度远大于特征维度才能展现效率优势,我们进一步提出新的邻域视觉变换器(VVT)结构,在不降低准确率的前提下减少特征维度。我们在CIFAR100、ImageNet1K和ADE20K数据集上进行了大量实验,验证了我们方法的有效性。随着输入分辨率提升,我们的方法相比此前基于变换器和卷积的网络具有更慢的GFlops增长率。特别地,我们的方法在参数减少50%的情况下实现了图像分类准确率的领先性能。