Salient Object Detection in Optical Remote Sensing Images Driven by Transformer

Existing methods for Salient Object Detection in Optical Remote Sensing Images (ORSI-SOD) mainly adopt Convolutional Neural Networks (CNNs) as the backbone, such as VGG and ResNet. Since CNNs can only extract features within certain receptive fields, most ORSI-SOD methods generally follow the local-to-contextual paradigm. In this paper, we propose a novel Global Extraction Local Exploration Network (GeleNet) for ORSI-SOD following the global-to-local paradigm. Specifically, GeleNet first adopts a transformer backbone to generate four-level feature embeddings with global long-range dependencies. Then, GeleNet employs a Direction-aware Shuffle Weighted Spatial Attention Module (D-SWSAM) and its simplified version (SWSAM) to enhance local interactions, and a Knowledge Transfer Module (KTM) to further enhance cross-level contextual interactions. D-SWSAM comprehensively perceives the orientation information in the lowest-level features through directional convolutions to adapt to various orientations of salient objects in ORSIs, and effectively enhances the details of salient objects with an improved attention mechanism. SWSAM discards the direction-aware part of D-SWSAM to focus on localizing salient objects in the highest-level features. KTM models the contextual correlation knowledge of two middle-level features of different scales based on the self-attention mechanism, and transfers the knowledge to the raw features to generate more discriminative features. Finally, a saliency predictor is used to generate the saliency map based on the outputs of the above three modules. Extensive experiments on three public datasets demonstrate that the proposed GeleNet outperforms relevant state-of-the-art methods. The code and results of our method are available at https://github.com/MathLee/GeleNet.

翻译：现有光学遥感图像显著目标检测（ORSI-SOD）方法主要采用卷积神经网络（CNN）作为骨干网络（如VGG和ResNet）。由于CNN仅能在特定感受野内提取特征，大多数ORSI-SOD方法普遍遵循局部到上下文的范式。本文提出一种遵循全局到局部范式的新型全局提取局部探索网络（GeleNet）用于ORSI-SOD。具体而言，GeleNet首先采用Transformer骨干网络生成具有全局长程依赖关系的四级特征嵌入。然后通过方向感知洗牌加权空间注意力模块（D-SWSAM）及其简化版本（SWSAM）增强局部交互，并利用知识迁移模块（KTM）进一步强化跨层级上下文交互。D-SWSAM通过方向卷积全面感知最低层级特征中的方位信息，以适应光学遥感图像中显著目标的多方向性，并借助改进的注意力机制有效增强显著目标细节；SWSAM舍弃D-SWSAM的方向感知部分，专注于定位最高层级特征中的显著目标；KTM基于自注意力机制建模不同尺度两级中间特征的上下文关联知识，并将知识迁移至原始特征以生成更具判别性的特征。最后，利用显著性预测器基于上述三个模块的输出生成显著图。在三个公共数据集上的大量实验表明，所提GeleNet方法优于现有最先进方法。本方法代码与结果见https://github.com/MathLee/GeleNet。