Recently Transformer has shown good performance in several vision tasks due to its powerful modeling capabilities. To reduce the quadratic complexity caused by the attention, some outstanding work restricts attention to local regions or extends axial interactions. However, these methos often lack the interaction of local and global information, balancing coarse and fine-grained information. To address this problem, we propose AxWin Attention, which models context information in both local windows and axial views. Based on the AxWin Attention, we develop a context-aware vision transformer backbone, named AxWin Transformer, which outperforming the state-of-the-art methods in both classification and downstream segmentation and detection tasks.
翻译:近年来,Transformer因其强大的建模能力,在多项视觉任务中展现出优异性能。为降低自注意力机制带来的二次复杂度,一些优秀工作将注意力限制在局部区域,或扩展为轴向交互。然而,这些方法往往缺乏局部与全局信息的交互,难以平衡粗细粒度信息。针对这一问题,我们提出AxWin注意力机制,该机制在局部窗口和轴向视图中同时建模上下文信息。基于AxWin注意力,我们开发了上下文感知的视觉Transformer主干——AxWin Transformer,其在分类任务以及下游分割和检测任务中均超越了现有最优方法。