Deep learning methods, especially Convolutional Neural Networks (CNN) and Vision Transformer (ViT), are frequently employed to perform semantic segmentation of high-resolution remotely sensed images. However, CNNs are constrained by their restricted receptive fields, while ViTs face challenges due to their quadratic complexity. Recently, the Mamba model, featuring linear complexity and a global receptive field, has gained extensive attention for vision tasks. In such tasks, images need to be serialized to form sequences compatible with the Mamba model. Numerous research efforts have explored scanning strategies to serialize images, aiming to enhance the Mamba model's understanding of images. However, the effectiveness of these scanning strategies remains uncertain. In this research, we conduct a comprehensive experimental investigation on the impact of mainstream scanning directions and their combinations on semantic segmentation of remotely sensed images. Through extensive experiments on the LoveDA, ISPRS Potsdam, and ISPRS Vaihingen datasets, we demonstrate that no single scanning strategy outperforms others, regardless of their complexity or the number of scanning directions involved. A simple, single scanning direction is deemed sufficient for semantic segmentation of high-resolution remotely sensed images. Relevant directions for future research are also recommended.
翻译:深度学习方法,特别是卷积神经网络(CNN)和视觉Transformer(ViT),常被用于高分辨率遥感影像的语义分割。然而,CNN受限于其有限的感受野,而ViT则面临二次复杂度的挑战。近期,具有线性复杂度和全局感受野的Mamba模型在视觉任务中引起广泛关注。在此类任务中,需将图像序列化以形成与Mamba模型兼容的序列。大量研究探索了用于序列化图像的扫描策略,旨在提升Mamba模型对图像的理解能力。然而,这些扫描策略的有效性仍不明确。本研究对主流扫描方向及其组合对遥感影像语义分割的影响进行了全面实验探究。通过在LoveDA、ISPRS Potsdam和ISPRS Vaihingen数据集上的大量实验,我们证明无论扫描策略的复杂度或涉及的扫描方向数量如何,没有任何单一策略能始终优于其他策略。简单的单一扫描方向即可满足高分辨率遥感影像语义分割的需求。同时,本文还提出了未来研究的相关方向建议。