Geometric information in the normalized digital surface models (nDSM) is highly correlated with the semantic class of the land cover. Exploiting two modalities (RGB and nDSM (height)) jointly has great potential to improve the segmentation performance. However, it is still an under-explored field in remote sensing due to the following challenges. First, the scales of existing datasets are relatively small and the diversity of existing datasets is limited, which restricts the ability of validation. Second, there is a lack of unified benchmarks for performance assessment, which leads to difficulties in comparing the effectiveness of different models. Last, sophisticated multi-modal semantic segmentation methods have not been deeply explored for remote sensing data. To cope with these challenges, in this paper, we introduce a new remote-sensing benchmark dataset for multi-modal semantic segmentation based on RGB-Height (RGB-H) data. Towards a fair and comprehensive analysis of existing methods, the proposed benchmark consists of 1) a large-scale dataset including co-registered RGB and nDSM pairs and pixel-wise semantic labels; 2) a comprehensive evaluation and analysis of existing multi-modal fusion strategies for both convolutional and Transformer-based networks on remote sensing data. Furthermore, we propose a novel and effective Transformer-based intermediary multi-modal fusion (TIMF) module to improve the semantic segmentation performance through adaptive token-level multi-modal fusion.The designed benchmark can foster future research on developing new methods for multi-modal learning on remote sensing data. Extensive analyses of those methods are conducted and valuable insights are provided through the experimental results. Code for the benchmark and baselines can be accessed at \url{https://github.com/EarthNets/RSI-MMSegmentation}.
翻译:归一化数字表面模型(nDSM)中的几何信息与地表覆盖的语义类别高度相关。联合利用RGB和nDSM(高程)两种模态具有显著提升分割性能的潜力。然而,由于以下挑战,该领域在遥感中仍待深入探索:首先,现有数据集规模较小且多样性有限,制约了验证能力;其次,缺乏统一的性能评估基准,导致不同模型间的有效性比较困难;最后,针对遥感数据的复杂多模态语义分割方法尚未得到充分研究。为应对这些挑战,本文基于RGB-高程(RGB-H)数据引入了一个新的遥感多模态语义分割基准数据集。为实现对现有方法的公平全面分析,该基准包含:1)大规模数据集,包括配准后的RGB与nDSM影像对及像素级语义标签;2)对现有基于卷积神经网络和Transformer网络的多模态融合策略在遥感数据上的综合评价与分析。此外,本文提出一种新颖有效的 Transformer中介型多模态融合(TIMF)模块,通过自适应令牌级多模态融合提升语义分割性能。所设计的基准可推动遥感数据多模态学习新方法的未来研究。实验中对各方法进行了广泛分析并提供了有价值的见解。基准及基线模型代码可从 \url{https://github.com/EarthNets/RSI-MMSegmentation} 获取。