Due to the large-scale image size and object variations, current CNN-based and Transformer-based approaches for remote sensing image semantic segmentation are suboptimal for capturing the long-range dependency or limited to the complex computational complexity. In this paper, we propose CM-UNet, comprising a CNN-based encoder for extracting local image features and a Mamba-based decoder for aggregating and integrating global information, facilitating efficient semantic segmentation of remote sensing images. Specifically, a CSMamba block is introduced to build the core segmentation decoder, which employs channel and spatial attention as the gate activation condition of the vanilla Mamba to enhance the feature interaction and global-local information fusion. Moreover, to further refine the output features from the CNN encoder, a Multi-Scale Attention Aggregation (MSAA) module is employed to merge the different scale features. By integrating the CSMamba block and MSAA module, CM-UNet effectively captures the long-range dependencies and multi-scale global contextual information of large-scale remote-sensing images. Experimental results obtained on three benchmarks indicate that the proposed CM-UNet outperforms existing methods in various performance metrics. The codes are available at https://github.com/XiaoBuL/CM-UNet.
翻译:由于遥感图像尺寸大且目标变化多样,当前基于CNN和Transformer的遥感图像语义分割方法在捕获长距离依赖性方面存在不足,或者受限于复杂的计算复杂度。本文提出CM-UNet,包含基于CNN的编码器用于提取局部图像特征,以及基于Mamba的解码器用于聚合和整合全局信息,从而促进遥感图像的高效语义分割。具体而言,引入CSMamba块构建核心分割解码器,该模块采用通道注意力和空间注意力作为原始Mamba的门控激活条件,以增强特征交互和全局-局部信息融合。此外,为进一步优化CNN编码器的输出特征,采用多尺度注意力聚合(MSAA)模块融合不同尺度的特征。通过集成CSMamba块和MSAA模块,CM-UNet有效捕获大规模遥感图像的长距离依赖性和多尺度全局上下文信息。在三个基准数据集上的实验结果表明,所提出的CM-UNet在各种性能指标上均优于现有方法。代码见https://github.com/XiaoBuL/CM-UNet。