SP-ViT: Learning 2D Spatial Priors for Vision Transformers

Recently, transformers have shown great potential in image classification and established state-of-the-art results on the ImageNet benchmark. However, compared to CNNs, transformers converge slowly and are prone to overfitting in low-data regimes due to the lack of spatial inductive biases. Such spatial inductive biases can be especially beneficial since the 2D structure of an input image is not well preserved in transformers. In this work, we present Spatial Prior-enhanced Self-Attention (SP-SA), a novel variant of vanilla Self-Attention (SA) tailored for vision transformers. Spatial Priors (SPs) are our proposed family of inductive biases that highlight certain groups of spatial relations. Unlike convolutional inductive biases, which are forced to focus exclusively on hard-coded local regions, our proposed SPs are learned by the model itself and take a variety of spatial relations into account. Specifically, the attention score is calculated with emphasis on certain kinds of spatial relations at each head, and such learned spatial foci can be complementary to each other. Based on SP-SA we propose the SP-ViT family, which consistently outperforms other ViT models with similar GFlops or parameters. Our largest model SP-ViT-L achieves a record-breaking 86.3% Top-1 accuracy with a reduction in the number of parameters by almost 50% compared to previous state-of-the-art model (150M for SP-ViT-L vs 271M for CaiT-M-36) among all ImageNet-1K models trained on 224x224 and fine-tuned on 384x384 resolution w/o extra data.

翻译：最近,变压器在图像分类方面显示出巨大的潜力,并在图像网络基准上确立了最新艺术成果。然而,与CNN相比,变压器缓慢地聚集,并且由于缺乏空间感化偏差,容易在低数据系统中过度适应低数据系统。这种空间感应偏差可能特别有益,因为输入图像的2D结构在变压器中没有得到很好的保存。在这项工作中,我们展示了空间先期增强的自我感应(SP-SA),这是为视觉变压器量定制的香草38自控(SA)的新版本。与CNN相比,空间感应器(SPs)是我们提议的显示某些空间关系组的感应偏向偏向偏向性偏向。与那些被迫完全专注于硬码本地区域的进动感应偏向偏向性偏向性偏向性偏向性偏向性偏向性偏向性偏向性偏向性偏向性偏向性偏向性偏向性偏向性偏向性偏向性偏向性偏向性偏向性偏向性偏向性偏向性偏向性偏向性偏向性偏向性偏向性偏向性偏向性偏向偏向性偏向性偏向性偏向性偏向性偏向性偏向性偏向性偏向性偏向性偏向偏向性偏向性偏向性偏偏向,因为这种偏向性偏向性偏向性偏向性偏向性偏向性偏向性偏向性偏向性偏向性偏向性偏向性偏向性偏向性偏向性偏偏向性偏向性偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏,因为地偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏偏