In recent years, the Vision Transformer (ViT) has garnered significant attention within the computer vision community. However, the core component of ViT, Self-Attention, lacks explicit spatial priors and suffers from quadratic computational complexity, limiting its applicability. To address these issues, we have proposed RMT, a robust vision backbone with explicit spatial priors for general purposes. RMT utilizes Manhattan distance decay to introduce spatial information and employs a horizontal and vertical decomposition attention method to model global information. Building on the strengths of RMT, Euclidean enhanced Vision Transformer (EVT) is an expanded version that incorporates several key improvements. Firstly, EVT uses a more reasonable Euclidean distance decay to enhance the modeling of spatial information, allowing for a more accurate representation of spatial relationships compared to the Manhattan distance used in RMT. Secondly, EVT abandons the decomposed attention mechanism featured in RMT and instead adopts a simpler spatially-independent grouping approach, providing the model with greater flexibility in controlling the number of tokens within each group. By addressing these modifications, EVT offers a more sophisticated and adaptable approach to incorporating spatial priors into the Self-Attention mechanism, thus overcoming some of the limitations associated with RMT and further enhancing its applicability in various computer vision tasks. Extensive experiments on Image Classification, Object Detection, Instance Segmentation, and Semantic Segmentation demonstrate that EVT exhibits exceptional performance. Without additional training data, EVT achieves 86.6% top1-acc on ImageNet-1k.
翻译:近年来,视觉Transformer(Vision Transformer,ViT)在计算机视觉领域引起了广泛关注。然而,ViT的核心组件自注意力机制缺乏明确的空间先验,且具有二次计算复杂度,限制了其应用性。为解决这些问题,我们提出了RMT——一种通用型、具备显式空间先验的鲁棒视觉主干网络。RMT利用曼哈顿距离衰减引入空间信息,并采用水平与垂直分解注意力方法建模全局信息。在RMT优势基础上,欧几里得增强视觉Transformer(EVT)作为扩展版本,引入了若干关键改进。首先,EVT采用更合理的欧几里得距离衰减增强空间信息建模,相较于RMT使用的曼哈顿距离,能够更精确地表达空间关系。其次,EVT摒弃了RMT中的分解注意力机制,转而采用更简单的空间无关分组方法,使模型能更灵活地控制每组中的令牌数量。通过上述改进,EVT提供了一种更精细且适应性更强的方案,可将空间先验融入自注意力机制,从而克服RMT的若干局限性,并进一步增强其在各类计算机视觉任务中的适用性。在图像分类、目标检测、实例分割和语义分割上的大量实验表明,EVT展现出卓越性能。在无额外训练数据的情况下,EVT在ImageNet-1k上达到86.6%的Top-1准确率。