Vision transformers (ViTs) that model an image as a sequence of partitioned patches have shown notable performance in diverse vision tasks. Because partitioning patches eliminates the image structure, to reflect the order of patches, ViTs utilize an explicit component called positional embedding. However, we claim that the use of positional embedding does not simply guarantee the order-awareness of ViT. To support this claim, we analyze the actual behavior of ViTs using an effective receptive field. We demonstrate that during training, ViT acquires an understanding of patch order from the positional embedding that is trained to be a specific pattern. Based on this observation, we propose explicitly adding a Gaussian attention bias that guides the positional embedding to have the corresponding pattern from the beginning of training. We evaluated the influence of Gaussian attention bias on the performance of ViTs in several image classification, object detection, and semantic segmentation experiments. The results showed that proposed method not only facilitates ViTs to understand images but also boosts their performance on various datasets, including ImageNet, COCO 2017, and ADE20K.
翻译:视觉Transformer(ViT)将图像建模为分割图斑序列,在各类视觉任务中表现出显著性能。由于图斑分割破坏了图像结构,为反映图斑顺序,ViT采用称为位置编码的显式组件。然而,我们主张使用位置编码并不能简单保证ViT的顺序感知能力。为支持这一观点,我们利用有效感受野分析了ViT的实际行为。实验证明,在训练过程中,ViT会从被训练成特定模式的位置编码中获取对图斑顺序的理解。基于这一发现,我们提出显式添加高斯注意力偏置,引导位置编码从训练初始阶段就具有相应模式。我们在图像分类、目标检测和语义分割多个实验中评估了高斯注意力偏置对ViT性能的影响。结果表明,所提方法不仅促进ViT理解图像,还能在ImageNet、COCO 2017和ADE20K等多个数据集上提升其性能表现。