Vision Transformers (ViTs) have recently become the state-of-the-art across many computer vision tasks. In contrast to convolutional networks (CNNs), ViTs enable global information sharing even within shallow layers of a network, i.e., among high-resolution features. However, this perk was later overlooked with the success of pyramid architectures such as Swin Transformer, which show better performance-complexity trade-offs. In this paper, we present a simple and efficient add-on component (termed GrafT) that considers global dependencies and multi-scale information throughout the network, in both high- and low-resolution features alike. It has the flexibility of branching out at arbitrary depths and shares most of the parameters and computations of the backbone. GrafT shows consistent gains over various well-known models which includes both hybrid and pure Transformer types, both homogeneous and pyramid structures, and various self-attention methods. In particular, it largely benefits mobile-size models by providing high-level semantics. On the ImageNet-1k dataset, GrafT delivers +3.9%, +1.4%, and +1.9% top-1 accuracy improvement to DeiT-T, Swin-T, and MobileViT-XXS, respectively. Our code and models will be made available.
翻译:视觉Transformer(ViTs)近期已成为众多计算机视觉任务中的最新技术。与卷积网络(CNNs)不同,ViT即使在网络浅层(即高分辨率特征层)也能实现全局信息共享。然而,随着金字塔架构(如Swin Transformer)的成功,这种优势后来被忽视,因为后者展示了更好的性能与复杂度权衡。本文提出一种简单高效的附加组件(称为GrafT),它能在整个网络中同时考虑高分辨率和低分辨率特征的全局依赖性和多尺度信息。该组件具有在任意深度分支的灵活性,并共享主干网络的大部分参数和计算量。GrafT在各种知名模型上均展现出一致性的性能提升,涵盖混合型和纯Transformer类型、同质与金字塔结构以及多种自注意力方法。特别地,它通过提供高级语义信息大幅提升了移动端尺寸模型的性能。在ImageNet-1k数据集上,GrafT分别为DeiT-T、Swin-T和MobileViT-XXS带来了+3.9%、+1.4%和+1.9%的top-1准确率提升。我们的代码和模型将公开提供。