We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization for computer vision. Our method leverages global context self-attention modules, joint with standard local self-attention, to effectively and efficiently model both long and short-range spatial interactions, without the need for expensive operations such as computing attention masks or shifting local windows. In addition, we address the lack of the inductive bias in ViTs, and propose to leverage a modified fused inverted residual blocks in our architecture. Our proposed GC ViT achieves state-of-the-art results across image classification, object detection and semantic segmentation tasks. On ImageNet-1K dataset for classification, the variants of GC ViT with 51M, 90M and 201M parameters achieve 84.3%, 85.0% and 85.7% Top-1 accuracy, respectively, at 224 image resolution and without any pre-training, hence surpassing comparably-sized prior art such as CNN-based ConvNeXt and ViT-based MaxViT and Swin Transformer by a large margin. Pre-trained GC ViT backbones in downstream tasks of object detection, instance segmentation, and semantic segmentation using MS COCO and ADE20K datasets outperform prior work consistently. Specifically, GC ViT with a 4-scale DINO detection head achieves a box AP of 58.3 on MS COCO dataset.
翻译:我们提出全局上下文视觉Transformer(GC ViT),一种新型架构,旨在提升计算机视觉任务中的参数与计算利用率。该方法融合全局上下文自注意力模块与标准局部自注意力,能够高效建模长短距离空间交互,无需计算注意力掩码或移动局部窗口等昂贵操作。此外,我们针对ViT中缺乏归纳偏置的问题,在架构中引入改进的融合倒置残差块。所提出的GC ViT在图像分类、目标检测与语义分割任务上均达到当前最优性能。在ImageNet-1K图像分类基准上,参数规模分别为51M、90M和201M的GC ViT变体,在224×224分辨率下无需预训练,Top-1准确率分别达到84.3%、85.0%和85.7%,显著超越同等规模的现有先进模型(如基于CNN的ConvNeXt、基于ViT的MaxViT与Swin Transformer)。在MS COCO与ADE20K数据集的下游任务中(目标检测、实例分割、语义分割),预训练的GC ViT骨干网络持续超越先前工作。具体而言,采用4尺度DINO检测头的GC ViT在MS COCO数据集上取得了58.3的框AP。