We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization for computer vision tasks. The core of the novel model are global context self-attention modules, joint with standard local self-attention, to effectively yet efficiently model both long and short-range spatial interactions, as an alternative to complex operations such as an attention masks or local windows shifting. While the local self-attention modules are responsible for modeling short-range information, the global query tokens are shared across all global self-attention modules to interact with local key and values. In addition, we address the lack of inductive bias in ViTs and improve the modeling of inter-channel dependencies by proposing a novel downsampler which leverages a parameter-efficient fused inverted residual block. The proposed GC ViT achieves new state-of-the-art performance across image classification, object detection and semantic segmentation tasks. On ImageNet-1K dataset for classification, GC ViT models with 51M, 90M and 201M parameters achieve 84.3%, 84.9% and 85.6% Top-1 accuracy, respectively, surpassing comparably-sized prior art such as CNN-based ConvNeXt and ViT-based Swin Transformer. Pre-trained GC ViT backbones in downstream tasks of object detection, instance segmentation, and semantic segmentation on MS COCO and ADE20K datasets outperform prior work consistently, sometimes by large margins.
翻译:我们提出了全局上下文视觉变换器(GC ViT)——一种新颖的架构,旨在增强计算机视觉任务中的参数与计算利用率。该模型的核心是全局上下文自注意力模块,它与标准局部自注意力相结合,可以高效且有效地建模长短距离的空间交互,作为注意力掩码或局部窗口移动等复杂操作的替代方案。局部自注意力模块负责建模短距离信息,而全局查询令牌在所有全局自注意力模块之间共享,以与局部键和值进行交互。此外,我们通过提出一种利用参数高效的融合倒残差块的新型降采样器,解决了视觉变换器中归纳偏置缺失的问题,并改善了通道间依赖关系的建模。所提出的GC ViT在图像分类、目标检测和语义分割任务中取得了新的最先进性能。在ImageNet-1K数据集上进行分类时,参数为51M、90M和201M的GC ViT模型分别达到了84.3%、84.9%和85.6%的Top-1准确率,超越了同等规模的前沿方法,如基于CNN的ConvNeXt和基于ViT的Swin Transformer。在MS COCO和ADE20K数据集上的下游任务(包括目标检测、实例分割和语义分割)中,预训练的GC ViT骨干网络一致地优于先前的工作,有时甚至大幅领先。