Vision Transformers (ViTs) have achieved remarkable success in computer vision tasks. However, their potential in rotation-sensitive scenarios has not been fully explored, and this limitation may be inherently attributed to the lack of spatial invariance in the data-forwarding process. In this study, we present a novel approach, termed Spatial Transform Decoupling (STD), providing a simple-yet-effective solution for oriented object detection with ViTs. Built upon stacked ViT blocks, STD utilizes separate network branches to predict the position, size, and angle of bounding boxes, effectively harnessing the spatial transform potential of ViTs in a divide-and-conquer fashion. Moreover, by aggregating cascaded activation masks (CAMs) computed upon the regressed parameters, STD gradually enhances features within regions of interest (RoIs), which complements the self-attention mechanism. Without bells and whistles, STD achieves state-of-the-art performance on the benchmark datasets including DOTA-v1.0 (82.24% mAP) and HRSC2016 (98.55% mAP), which demonstrates the effectiveness of the proposed method. Source code is available at https://github.com/yuhongtian17/Spatial-Transform-Decoupling.
翻译:视觉Transformer(ViTs)在计算机视觉任务中取得了显著成功。然而,其在旋转敏感场景中的潜力尚未得到充分探索,这一局限性可能本质上源于数据前向传播过程中空间不变性的缺失。本研究提出了一种名为空间变换解耦(STD)的新方法,为基于ViTs的面向目标检测提供了一种简单而有效的解决方案。STD基于堆叠的ViT模块,利用独立的网络分支预测边界框的位置、尺寸和角度,以分治策略有效挖掘ViTs的空间变换潜力。此外,通过聚合基于回归参数计算的级联激活掩码(CAMs),STD逐步增强感兴趣区域(RoIs)内的特征,从而补充自注意力机制。无需复杂设计,STD在基准数据集DOTA-v1.0(82.24% mAP)和HRSC2016(98.55% mAP)上均达到了最先进性能,验证了所提方法的有效性。源代码已公开于https://github.com/yuhongtian17/Spatial-Transform-Decoupling。