High-resolution remotely sensed images pose a challenge for commonly used semantic segmentation methods such as Convolutional Neural Network (CNN) and Vision Transformer (ViT). CNN-based methods struggle with handling such high-resolution images due to their limited receptive field, while ViT faces challenges in handling long sequences. Inspired by Mamba, which adopts a State Space Model (SSM) to efficiently capture global semantic information, we propose a semantic segmentation framework for high-resolution remotely sensed images, named Samba. Samba utilizes an encoder-decoder architecture, with Samba blocks serving as the encoder for efficient multi-level semantic information extraction, and UperNet functioning as the decoder. We evaluate Samba on the LoveDA, ISPRS Vaihingen, and ISPRS Potsdam datasets, comparing its performance against top-performing CNN and ViT methods. The results reveal that Samba achieved unparalleled performance on commonly used remote sensing datasets for semantic segmentation. Our proposed Samba demonstrates for the first time the effectiveness of SSM in semantic segmentation of remotely sensed images, setting a new benchmark in performance for Mamba-based techniques in this specific application. The source code and baseline implementations are available at https://github.com/zhuqinfeng1999/Samba.
翻译:高分辨率遥感图像对常用的语义分割方法(如卷积神经网络CNN和视觉Transformer ViT)提出了挑战。基于CNN的方法因其有限的感受野而难以处理此类高分辨率图像,而ViT则在处理长序列时面临困难。受采用状态空间模型(SSM)高效捕获全局语义信息的Mamba启发,我们提出了一种面向高分辨率遥感图像的语义分割框架,命名为Samba。Samba采用编码器-解码器架构,其中Samba块作为编码器以高效提取多级语义信息,UperNet则作为解码器。我们在LoveDA、ISPRS Vaihingen和ISPRS Potsdam数据集上评估了Samba,并将其与性能最优的CNN和ViT方法进行了比较。结果表明,Samba在常用的遥感语义分割数据集上取得了无与伦比的性能。我们提出的Samba首次证明了SSM在遥感图像语义分割中的有效性,为基于Mamba的技术在此特定应用中树立了新的性能基准。源代码和基线实现可在https://github.com/zhuqinfeng1999/Samba获取。