The depth/thermal information is beneficial for detecting salient object with conventional RGB images. However, in dual-modal salient object detection (SOD) model, the robustness against noisy inputs and modality missing is crucial but rarely studied. To tackle this problem, we introduce \textbf{Co}nditional Dropout and \textbf{LA}nguage-driven(\textbf{CoLA}) framework comprising two core components. 1) Language-driven Quality Assessment (LQA): Leveraging a pretrained vision-language model with a prompt learner, the LQA recalibrates image contributions without requiring additional quality annotations. This approach effectively mitigates the impact of noisy inputs. 2) Conditional Dropout (CD): A learning method to strengthen the model's adaptability in scenarios with missing modalities, while preserving its performance under complete modalities. The CD serves as a plug-in training scheme that treats modality-missing as conditions, strengthening the overall robustness of various dual-modal SOD models. Extensive experiments demonstrate that the proposed method outperforms state-of-the-art dual-modal SOD models, under both modality-complete and modality-missing conditions. We will release source code upon acceptance.
翻译:深度/热成像信息结合常规RGB图像有助于检测显著性目标。然而,在双模态显著性目标检测模型中,针对噪声输入与模态缺失的鲁棒性至关重要却鲜有研究。为解决此问题,本文提出包含两个核心组件的**条件性随机丢弃与语言驱动**框架:1)语言驱动的质量评估:通过预训练的视觉-语言模型与提示学习器,该组件无需额外质量标注即可重新校准图像贡献度,有效抑制噪声输入的影响;2)条件性随机丢弃:一种增强模型在模态缺失场景下适应性的学习方法,同时保持其在完整模态下的性能。该训练方案可作为即插即用模块,将模态缺失视为条件变量,从而全面提升各类双模态显著性目标检测模型的整体鲁棒性。大量实验表明,所提方法在模态完整与模态缺失条件下均优于当前最先进的双模态显著性目标检测模型。代码将在论文录用后开源。