RGB-Thermal Salient Object Detection aims to pinpoint prominent objects within aligned pairs of visible and thermal infrared images. Traditional encoder-decoder architectures, while designed for cross-modality feature interactions, may not have adequately considered the robustness against noise originating from defective modalities. Inspired by hierarchical human visual systems, we propose the ConTriNet, a robust Confluent Triple-Flow Network employing a Divide-and-Conquer strategy. Specifically, ConTriNet comprises three flows: two modality-specific flows explore cues from RGB and Thermal modalities, and a third modality-complementary flow integrates cues from both modalities. ConTriNet presents several notable advantages. It incorporates a Modality-induced Feature Modulator in the modality-shared union encoder to minimize inter-modality discrepancies and mitigate the impact of defective samples. Additionally, a foundational Residual Atrous Spatial Pyramid Module in the separated flows enlarges the receptive field, allowing for the capture of multi-scale contextual information. Furthermore, a Modality-aware Dynamic Aggregation Module in the modality-complementary flow dynamically aggregates saliency-related cues from both modality-specific flows. Leveraging the proposed parallel triple-flow framework, we further refine saliency maps derived from different flows through a flow-cooperative fusion strategy, yielding a high-quality, full-resolution saliency map for the final prediction. To evaluate the robustness and stability of our approach, we collect a comprehensive RGB-T SOD benchmark, VT-IMAG, covering various real-world challenging scenarios. Extensive experiments on public benchmarks and our VT-IMAG dataset demonstrate that ConTriNet consistently outperforms state-of-the-art competitors in both common and challenging scenarios.
翻译:RGB-热显著目标检测旨在精确定位对齐的可见光与热红外图像对中的突出物体。传统编码器-解码器架构虽为跨模态特征交互设计,但可能未充分考虑对源自缺陷模态噪声的鲁棒性。受人类层次化视觉系统启发,我们提出ConTriNet——一种采用分治策略的鲁棒性汇流三流网络。具体而言,ConTriNet包含三个流:两个模态专用流分别从RGB和热模态中探索线索,第三个模态互补流则整合来自双模态的线索。ConTriNet具有若干显著优势:在模态共享的联合编码器中引入模态诱导特征调制器,以最小化模态间差异并减轻缺陷样本的影响;在分离流中构建基础性残差空洞空间金字塔模块,通过扩大感受野捕获多尺度上下文信息;在模态互补流中采用模态感知动态聚合模块,动态聚合来自两个模态专用流的显著性相关线索。借助所提出的并行三流框架,我们进一步通过流协同融合策略优化源自不同流的显著图,最终生成用于预测的高质量全分辨率显著图。为评估方法的鲁棒性与稳定性,我们构建了涵盖多种现实挑战场景的综合RGB-T SOD基准数据集VT-IMAG。在公开基准及VT-IMAG数据集上的大量实验表明,ConTriNet在常规与挑战场景中均持续优于现有最先进方法。