Vision Transformers (ViTs) have achieved remarkable success in computer vision tasks. However, their potential in rotation-sensitive scenarios has not been fully explored, and this limitation may be inherently attributed to the lack of spatial invariance in the data-forwarding process. In this study, we present a novel approach, termed Spatial Transform Decoupling (STD), providing a simple-yet-effective solution for oriented object detection with ViTs. Built upon stacked ViT blocks, STD utilizes separate network branches to predict the position, size, and angle of bounding boxes, effectively harnessing the spatial transform potential of ViTs in a divide-and-conquer fashion. Moreover, by aggregating cascaded activation masks (CAMs) computed upon the regressed parameters, STD gradually enhances features within regions of interest (RoIs), which complements the self-attention mechanism. Without bells and whistles, STD achieves state-of-the-art performance on the benchmark datasets including DOTA-v1.0 (82.24% mAP) and HRSC2016 (98.55% mAP), which demonstrates the effectiveness of the proposed method. Source code is available at https://github.com/yuhongtian17/Spatial-Transform-Decoupling.
翻译:视觉Transformer(ViTs)在计算机视觉任务中取得了显著成功。然而,其在旋转敏感场景中的潜力尚未被充分探索,这一局限性可能固有地源于数据前向过程中缺乏空间不变性。在本研究中,我们提出了一种名为空间变换解耦(STD)的新方法,为基于ViTs的定向目标检测提供了一种简单而有效的解决方案。基于堆叠的ViT模块,STD利用独立的网络分支预测边界框的位置、尺寸和角度,以分治的方式有效利用了ViTs的空间变换潜力。此外,通过聚合基于回归参数计算的级联激活掩码(CAMs),STD逐步增强了感兴趣区域(RoIs)内的特征,从而补充了自注意力机制。无需复杂附加技术,STD在基准数据集DOTA-v1.0(82.24% mAP)和HRSC2016(98.55% mAP)上达到了最先进的性能,验证了所提方法的有效性。源代码已公开于https://github.com/yuhongtian17/Spatial-Transform-Decoupling。