TUNI: Unifying Pre-training and Fine-tuning with Modality-Aware Mutual Learning and Rectification for RGB-T Semantic Segmentation

from arxiv, This paper is an extended version of the authors' work previously presented at the ICRA conference. To appear in IEEE Transactions on Circuits and Systems for Video Technology. DOl: 10.1109/TCSVT.2026.3701706

RGB-thermal (RGB-T) semantic segmentation improves the environmental perception of autonomous platforms in challenging conditions. Prevailing RGB-T segmentation frameworks suffer from suboptimal multi-modal feature extraction and fusion, unbalanced modality dependency, and inadequate utilization of thermal information. To address these challenges, we propose TUNI, a unified pre-training and fine-tuning framework for efficient and real-time RGB-T semantic segmentation. It pre-trains an RGB-T encoder that incorporates an RGB-T local module that selectively emphasizes salient consistent and distinct local features across modalities, thereby integrating cross-modal feature extraction and fusion in a unified manner. To alleviate the modality bias issue during RGB-T pre-training, modality-inverted contrastive mutual learning is introduced to enable knowledge exchange between two RGB-dominated and thermal-dominated encoders. In the fine-tuning phase, modality rectification learning fully exploits residual thermal information by focusing on correct yet divergent prediction regions between two modality-specific decoders. We further develop three TUNI variants, covering lightweight, balanced, and high-performance requirements. Extensive experiments on five RGB-T semantic segmentation datasets demonstrate that TUNI achieves superior accuracy, generalization, and compactness compared with 15 state-of-the-art models. The code is available at https://github.com/xiaodonguo/TUNI-v2.

翻译：RGB-热红外（RGB-T）语义分割提升了自主平台在挑战性环境下的环境感知能力。现有RGB-T分割框架存在多模态特征提取与融合次优、模态依赖不均衡以及热红外信息利用不充分等问题。为应对这些挑战，我们提出TUNI——一种面向高效实时RGB-T语义分割的统一预训练与微调框架。该框架预训练一个RGB-T编码器，其中嵌入RGB-T局部模块，该模块可跨模态选择性强调显著一致性和差异性局部特征，从而以统一方式实现跨模态特征提取与融合。为缓解RGB-T预训练中的模态偏置问题，引入模态反转对比互学习机制，使两个分别以RGB主导和热红外主导的编码器之间实现知识交换。在微调阶段，模态矫正学习通过聚焦两个模态专用解码器中正确但存在差异的预测区域，充分挖掘残差热红外信息。我们进一步开发了三种TUNI变体，涵盖轻量级、均衡型和高性能需求。在五个RGB-T语义分割数据集上的大量实验表明，TUNI在精度、泛化性和紧凑性方面均优于15种现有最优模型。代码见https://github.com/xiaodonguo/TUNI-v2。