CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers

from arxiv, Accepted to IEEE Transactions on Intelligent Transportation Systems (T-ITS). The source code of CMX is publicly available at https://github.com/huaaaliu/RGBX_Semantic_Segmentation

Scene understanding based on image segmentation is a crucial component of autonomous vehicles. Pixel-wise semantic segmentation of RGB images can be advanced by exploiting complementary features from the supplementary modality (X-modality). However, covering a wide variety of sensors with a modality-agnostic model remains an unresolved problem due to variations in sensor characteristics among different modalities. Unlike previous modality-specific methods, in this work, we propose a unified fusion framework, CMX, for RGB-X semantic segmentation. To generalize well across different modalities, that often include supplements as well as uncertainties, a unified cross-modal interaction is crucial for modality fusion. Specifically, we design a Cross-Modal Feature Rectification Module (CM-FRM) to calibrate bi-modal features by leveraging the features from one modality to rectify the features of the other modality. With rectified feature pairs, we deploy a Feature Fusion Module (FFM) to perform sufficient exchange of long-range contexts before mixing. To verify CMX, for the first time, we unify five modalities complementary to RGB, i.e., depth, thermal, polarization, event, and LiDAR. Extensive experiments show that CMX generalizes well to diverse multi-modal fusion, achieving state-of-the-art performances on five RGB-Depth benchmarks, as well as RGB-Thermal, RGB-Polarization, and RGB-LiDAR datasets. Besides, to investigate the generalizability to dense-sparse data fusion, we establish an RGB-Event semantic segmentation benchmark based on the EventScape dataset, on which CMX sets the new state-of-the-art. The source code of CMX is publicly available at https://github.com/huaaaliu/RGBX_Semantic_Segmentation.

翻译：基于图像分割的场景理解是自动驾驶的关键组成部分。通过利用辅助模态（X-模态）的互补特征，可推进RGB图像的像素级语义分割。然而，由于不同模态之间的传感器特性差异，用模态无关模型覆盖多种传感器仍是一个未解问题。不同于以往特定模态的方法，本研究提出统一融合框架CMX，用于RGB-X语义分割。为了在不同模态（常包含互补信息与不确定性）间实现良好泛化，统一的跨模态交互对模态融合至关重要。具体而言，我们设计了跨模态特征校正模块（CM-FRM），通过利用一个模态的特征校正另一模态的特征来校准双模态特征。借助校正后的特征对，我们部署特征融合模块（FFM），在混合前充分交换长程上下文。为验证CMX，我们首次统一了五种与RGB互补的模态：深度、热红外、偏振、事件和激光雷达。大量实验表明，CMX能良好泛化至多种多模态融合场景，在五个RGB-深度基准数据集以及RGB-热红外、RGB-偏振和RGB-激光雷达数据集上均达到最优性能。此外，为探究其对稠密-稀疏数据融合的泛化能力，我们基于EventScape数据集建立了RGB-事件语义分割基准，CMX在此基准上创下新最优结果。CMX源代码已在https://github.com/huaaaliu/RGBX_Semantic_Segmentation 公开。