RGB-T saliency detection has emerged as an important computer vision task, identifying conspicuous objects in challenging scenes such as dark environments. However, existing methods neglect the characteristics of cross-modal features and rely solely on network structures to fuse RGB and thermal features. To address this, we first propose a Multi-Modal Hybrid loss (MMHL) that comprises supervised and self-supervised loss functions. The supervised loss component of MMHL distinctly utilizes semantic features from different modalities, while the self-supervised loss component reduces the distance between RGB and thermal features. We further consider both spatial and channel information during feature fusion and propose the Hybrid Fusion Module to effectively fuse RGB and thermal features. Lastly, instead of jointly training the network with cross-modal features, we implement a sequential training strategy which performs training only on RGB images in the first stage and then learns cross-modal features in the second stage. This training strategy improves saliency detection performance without computational overhead. Results from performance evaluation and ablation studies demonstrate the superior performance achieved by the proposed method compared with the existing state-of-the-art methods.
翻译:RGB-T显著性检测已成为一项重要的计算机视觉任务,旨在识别暗环境等复杂场景中的显著物体。然而,现有方法忽视了跨模态特征的特点,仅依赖网络结构来融合RGB和热红外特征。为解决这一问题,我们首先提出了一种多模态混合损失函数(MMHL),该函数包含监督和自监督损失函数。MMHL的监督损失部分明确利用了不同模态的语义特征,而自监督损失部分则减小了RGB与热红外特征之间的距离。我们进一步在特征融合过程中考虑了空间和通道信息,并提出了混合融合模块以有效融合RGB和热红外特征。最后,我们采用顺序训练策略,而非使用跨模态特征联合训练网络:该策略在第一阶段仅基于RGB图像进行训练,在第二阶段再学习跨模态特征。该训练策略在不增加计算开销的情况下提升了显著性检测性能。性能评估和消融实验的结果表明,与现有最先进方法相比,所提出的方法取得了更优异的性能。