Beyond Global Scanning: Adaptive Visual State Space Modeling for Salient Object Detection in Optical Remote Sensing Images

Salient object detection (SOD) in optical remote sensing images (ORSIs) faces numerous challenges, including significant variations in target scales and low contrast between targets and the background. Existing methods based on vision transformers (ViTs) and convolutional neural networks (CNNs) architectures aim to leverage both global and local features, but the difficulty in effectively integrating these heterogeneous features limits their overall performance. To overcome these limitations, we propose an adaptive state space context network (ASCNet), which builds upon the state space model mechanism to simultaneously capture long-range dependencies and enhance regional feature representation. Specifically, we employ the visual state space encoder to extract multi-scale features. To further achieve deep guidance and enhancement of these features, we design a Multi-Level Context Module (MLCM), which module strengthens cross-layer interaction capabilities between features of different scales while enhancing the model's structural perception, allowing it to distinguish between foreground and background more effectively. Then, we design the Adaptive Patchwise Visual State Space (APVSS) block as the decoder of ASCNet, which integrates our proposed Dynamic Adaptive Granularity Scan (DAGS) and Granularity-aware Propagation Module (GPM). It performs adaptive patch scanning on feature maps enhanced by local perception, thereby capturing rich local region information and enhancing state space model's local modeling capability. Extensive experimental results demonstrate that the proposed model achieves state-of-the-art performance, validating its effectiveness and superiority.

翻译：光学遥感图像（ORSIs）中的显著目标检测（SOD）面临诸多挑战，包括目标尺度变化显著以及目标与背景之间对比度低。现有的基于视觉Transformer（ViTs）和卷积神经网络（CNNs）架构的方法旨在利用全局和局部特征，但难以有效整合这些异构特征，限制了其整体性能。为克服这些局限性，我们提出了一种自适应状态空间上下文网络（ASCNet），该网络基于状态空间模型机制，旨在同时捕获长程依赖关系并增强区域特征表示。具体而言，我们采用视觉状态空间编码器来提取多尺度特征。为了进一步实现对这些特征的深度引导和增强，我们设计了一个多级上下文模块（MLCM），该模块增强了不同尺度特征之间的跨层交互能力，同时提升了模型的结构感知能力，使其能更有效地区分前景与背景。随后，我们设计了自适应分块视觉状态空间（APVSS）模块作为ASCNet的解码器，该模块集成了我们提出的动态自适应粒度扫描（DAGS）和粒度感知传播模块（GPM）。它在经过局部感知增强的特征图上执行自适应分块扫描，从而捕获丰富的局部区域信息，并增强状态空间模型的局部建模能力。大量实验结果表明，所提模型实现了最先进的性能，验证了其有效性和优越性。