Attention within windows has been widely explored in vision transformers to balance the performance, computation complexity, and memory footprint. However, current models adopt a hand-crafted fixed-size window design, which restricts their capacity of modeling long-term dependencies and adapting to objects of different sizes. To address this drawback, we propose \textbf{V}aried-\textbf{S}ize Window \textbf{A}ttention (VSA) to learn adaptive window configurations from data. Specifically, based on the tokens within each default window, VSA employs a window regression module to predict the size and location of the target window, i.e., the attention area where the key and value tokens are sampled. By adopting VSA independently for each attention head, it can model long-term dependencies, capture rich context from diverse windows, and promote information exchange among overlapped windows. VSA is an easy-to-implement module that can replace the window attention in state-of-the-art representative models with minor modifications and negligible extra computational cost while improving their performance by a large margin, e.g., 1.1\% for Swin-T on ImageNet classification. In addition, the performance gain increases when using larger images for training and test. Experimental results on more downstream tasks, including object detection, instance segmentation, and semantic segmentation, further demonstrate the superiority of VSA over the vanilla window attention in dealing with objects of different sizes. The code will be released https://github.com/ViTAE-Transformer/ViTAE-VSA.
翻译:窗口内的注意力机制在视觉Transformer中被广泛探索,以平衡性能、计算复杂度和内存占用。然而,当前模型采用手工设计的固定大小窗口,限制了其建模长程依赖关系以及适应不同尺寸物体的能力。为解决这一缺陷,我们提出**可变形大小窗口注意力**(VSA),从数据中学习自适应窗口配置。具体而言,基于每个默认窗口内的令牌,VSA利用窗口回归模块预测目标窗口的大小和位置,即采样键和值令牌的注意力区域。通过为每个注意力头独立采用VSA,该模块能够建模长程依赖关系,从多样化的窗口中捕获丰富的上下文信息,并促进重叠窗口间的信息交换。VSA是一个易于实现的模块,能够以微小修改和可忽略的额外计算成本替代最先进代表性模型中的窗口注意力,同时显著提升其性能,例如在ImageNet分类任务中为Swin-T提升1.1%。此外,当使用更大尺寸图像进行训练和测试时,性能增益进一步增加。在包括目标检测、实例分割和语义分割在内的更多下游任务上的实验结果进一步证明了VSA相较于传统窗口注意力在处理不同尺寸物体时的优越性。代码将开源:https://github.com/ViTAE-Transformer/ViTAE-VSA。