CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers

from arxiv, Accepted to IEEE Transactions on Intelligent Transportation Systems (T-ITS). The source code of CMX is publicly available at https://github.com/huaaaliu/RGBX_Semantic_Segmentation

Scene understanding based on image segmentation is a crucial component of autonomous vehicles. Pixel-wise semantic segmentation of RGB images can be advanced by exploiting complementary features from the supplementary modality (X-modality). However, covering a wide variety of sensors with a modality-agnostic model remains an unresolved problem due to variations in sensor characteristics among different modalities. Unlike previous modality-specific methods, in this work, we propose a unified fusion framework, CMX, for RGB-X semantic segmentation. To generalize well across different modalities, that often include supplements as well as uncertainties, a unified cross-modal interaction is crucial for modality fusion. Specifically, we design a Cross-Modal Feature Rectification Module (CM-FRM) to calibrate bi-modal features by leveraging the features from one modality to rectify the features of the other modality. With rectified feature pairs, we deploy a Feature Fusion Module (FFM) to perform sufficient exchange of long-range contexts before mixing. To verify CMX, for the first time, we unify five modalities complementary to RGB, i.e., depth, thermal, polarization, event, and LiDAR. Extensive experiments show that CMX generalizes well to diverse multi-modal fusion, achieving state-of-the-art performances on five RGB-Depth benchmarks, as well as RGB-Thermal, RGB-Polarization, and RGB-LiDAR datasets. Besides, to investigate the generalizability to dense-sparse data fusion, we establish an RGB-Event semantic segmentation benchmark based on the EventScape dataset, on which CMX sets the new state-of-the-art. The source code of CMX is publicly available at https://github.com/huaaaliu/RGBX_Semantic_Segmentation.

翻译：基于图像分割的场景理解是自动驾驶的关键组成部分。通过利用互补模态（X-模态）的补充特征，可以推进RGB图像的像素级语义分割。然而，由于不同模态的传感器特性存在差异，用模态无关的模型覆盖多种传感器仍是一个未解决的问题。不同于以往针对特定模态的方法，本文提出了一种统一的融合框架CMX，用于RGB-X语义分割。为在不同模态（通常包含补充信息与不确定性）间实现良好泛化，统一的跨模态交互对模态融合至关重要。具体而言，我们设计了跨模态特征修正模块（CM-FRM），利用一个模态的特征校准另一模态的特征，从而修正双模态特征对。基于修正后的特征对，我们部署特征融合模块（FFM），在混合前实现长程上下文信息的充分交换。为验证CMX，我们首次统一了五种与RGB互补的模态，即深度、热红外、偏振、事件和LiDAR。大量实验表明，CMX能良好泛化至多种多模态融合场景，在五个RGB-深度基准、RGB-热红外、RGB-偏振和RGB-LiDAR数据集上均达到最优性能。此外，为探究其对密集-稀疏数据融合的泛化能力，我们基于EventScape数据集构建了RGB-事件语义分割基准，CMX在此基准上刷新了最优结果。CMX源代码已公开于https://github.com/huaaaliu/RGBX_Semantic_Segmentation。