This paper delves into the task of arbitrary modality salient object detection (AM SOD), aiming to detect salient objects from arbitrary modalities, eg RGB images, RGB-D images, and RGB-D-T images. A novel modality-adaptive Transformer (MAT) will be proposed to investigate two fundamental challenges of AM SOD, ie more diverse modality discrepancies caused by varying modality types that need to be processed, and dynamic fusion design caused by an uncertain number of modalities present in the inputs of multimodal fusion strategy. Specifically, inspired by prompt learning's ability of aligning the distributions of pre-trained models to the characteristic of downstream tasks by learning some prompts, MAT will first present a modality-adaptive feature extractor (MAFE) to tackle the diverse modality discrepancies by introducing a modality prompt for each modality. In the training stage, a new modality translation contractive (MTC) loss will be further designed to assist MAFE in learning those modality-distinguishable modality prompts. Accordingly, in the testing stage, MAFE can employ those learned modality prompts to adaptively adjust its feature space according to the characteristics of the input modalities, thus being able to extract discriminative unimodal features. Then, MAFE will present a channel-wise and spatial-wise fusion hybrid (CSFH) strategy to meet the demand for dynamic fusion. For that, CSFH dedicates a channel-wise dynamic fusion module (CDFM) and a novel spatial-wise dynamic fusion module (SDFM) to fuse the unimodal features from varying numbers of modalities and meanwhile effectively capture cross-modal complementary semantic and detail information, respectively. Moreover, CSFH will carefully align CDFM and SDFM to different levels of unimodal features based on their characteristics for more effective complementary information exploitation.
翻译:本文深入研究了任意模态显著目标检测(AM SOD)任务,旨在从RGB图像、RGB-D图像、RGB-D-T图像等任意模态中检测显著目标。针对AM SOD的两大核心挑战——因模态类型多样化导致的更复杂模态差异,以及多模态融合策略中因输入模态数量不确定引发的动态融合设计需求,本文提出了一种新颖的模态自适应Transformer(MAT)。具体而言,受提示学习通过学习提示对齐预训练模型分布与下游任务特性的启发,MAT首先设计了模态自适应特征提取器(MAFE),通过为每种模态引入模态提示来应对多样化的模态差异。在训练阶段,进一步设计了新型模态翻译对比(MTC)损失函数,辅助MAFE学习这些可区分模态的提示。相应地,在测试阶段,MAFE能够利用已学习的模态提示,根据输入模态特性自适应调整特征空间,从而提取具有判别性的单模态特征。随后,MAFE提出通道-空间融合混合(CSFH)策略以满足动态融合需求。为此,CSFH分别设计了通道动态融合模块(CDFM)和空间动态融合模块(SDFM),用于融合来自不同数量模态的单模态特征,并有效捕获跨模态互补的语义与细节信息。此外,CSFH根据单模态特征特性,将CDFM与SDFM精确对齐至不同层级,以实现更高效的互补信息利用。