Vision Transformer has demonstrated impressive success across various vision tasks. However, its heavy computation cost, which grows quadratically with respect to the token sequence length, largely limits its power in handling large feature maps. To alleviate the computation cost, previous works rely on either fine-grained self-attentions restricted to local small regions, or global self-attentions but to shorten the sequence length resulting in coarse granularity. In this paper, we propose a novel model, termed as Self-guided Transformer~(SG-Former), towards effective global self-attention with adaptive fine granularity. At the heart of our approach is to utilize a significance map, which is estimated through hybrid-scale self-attention and evolves itself during training, to reallocate tokens based on the significance of each region. Intuitively, we assign more tokens to the salient regions for achieving fine-grained attention, while allocating fewer tokens to the minor regions in exchange for efficiency and global receptive fields. The proposed SG-Former achieves performance superior to state of the art: our base size model achieves \textbf{84.7\%} Top-1 accuracy on ImageNet-1K, \textbf{51.2mAP} bbAP on CoCo, \textbf{52.7mIoU} on ADE20K surpassing the Swin Transformer by \textbf{+1.3\% / +2.7 mAP/ +3 mIoU}, with lower computation costs and fewer parameters. The code is available at \href{https://github.com/OliverRensu/SG-Former}{https://github.com/OliverRensu/SG-Former}
翻译:视觉Transformer在各种视觉任务中已展现出显著成功。然而,其计算成本随令牌序列长度呈二次增长,极大限制了处理大尺度特征图的能力。为降低计算成本,先前工作要么将细粒度自注意力限制在局部小区域,要么采用全局自注意力但缩短序列长度导致粗粒度。本文提出一种新颖模型——自引导Transformer(SG-Former),旨在实现具有自适应细粒度的有效全局自注意力。该方法的核心是利用通过混合尺度自注意力估计并在训练过程中演化的显著性图,基于各区域的显著性重新分配令牌。直观上,我们对显著区域分配更多令牌以实现细粒度注意力,而对次要区域分配较少令牌以换取效率和全局感受野。所提出的SG-Former在性能上超越了当前最先进模型:基础尺寸模型在ImageNet-1K上达到\textbf{84.7\%} Top-1准确率,在CoCo上达到\textbf{51.2 mAP} bbAP,在ADE20K上达到\textbf{52.7 mIoU},分别以更低计算成本和更少参数超越Swin Transformer \textbf{+1.3\% / +2.7 mAP / +3 mIoU}。代码发布于\href{https://github.com/OliverRensu/SG-Former}{https://github.com/OliverRensu/SG-Former}。