Adapting Segment Anything Model to Multi-modal Salient Object Detection with Semantic Feature Fusion Guidance

Although most existing multi-modal salient object detection (SOD) methods demonstrate effectiveness through training models from scratch, the limited multi-modal data hinders these methods from reaching optimality. In this paper, we propose a novel framework to explore and exploit the powerful feature representation and zero-shot generalization ability of the pre-trained Segment Anything Model (SAM) for multi-modal SOD. Despite serving as a recent vision fundamental model, driving the class-agnostic SAM to comprehend and detect salient objects accurately is non-trivial, especially in challenging scenes. To this end, we develop \underline{SAM} with se\underline{m}antic f\underline{e}ature fu\underline{s}ion guidanc\underline{e} (Sammese), which incorporates multi-modal saliency-specific knowledge into SAM to adapt SAM to multi-modal SOD tasks. However, it is difficult for SAM trained on single-modal data to directly mine the complementary benefits of multi-modal inputs and comprehensively utilize them to achieve accurate saliency prediction. To address these issues, we first design a multi-modal complementary fusion module to extract robust multi-modal semantic features by integrating information from visible and thermal or depth image pairs. Then, we feed the extracted multi-modal semantic features into both the SAM image encoder and mask decoder for fine-tuning and prompting, respectively. Specifically, in the image encoder, a multi-modal adapter is proposed to adapt the single-modal SAM to multi-modal information. In the mask decoder, a semantic-geometric prompt generation strategy is proposed to produce corresponding embeddings with various saliency cues. Extensive experiments on both RGB-D and RGB-T SOD benchmarks show the effectiveness of the proposed framework. The code will be available at \url{https://github.com/Angknpng/Sammese}.

翻译：尽管现有的大多数多模态显著目标检测方法通过从头训练模型证明了其有效性，但有限的多模态数据阻碍了这些方法达到最优性能。本文提出了一种新颖的框架，旨在探索并利用预训练Segment Anything模型强大的特征表示能力和零样本泛化能力，以应用于多模态显著目标检测任务。尽管SAM作为近期的基础视觉模型表现出色，但驱动这种类别无关的SAM准确理解并检测显著目标并非易事，尤其是在复杂场景中。为此，我们开发了基于语义特征融合引导的SAM模型，通过将多模态显著性特定知识融入SAM，使其适配多模态显著目标检测任务。然而，在单模态数据上训练的SAM难以直接挖掘多模态输入的互补优势，并综合利用这些信息实现精确的显著性预测。为解决这些问题，我们首先设计了一个多模态互补融合模块，通过整合可见光与热成像或深度图像对的信息来提取鲁棒的多模态语义特征。随后，我们将提取的多模态语义特征分别输入SAM图像编码器和掩码解码器进行微调与提示。具体而言，在图像编码器中，我们提出了多模态适配器，使单模态SAM能够适应多模态信息；在掩码解码器中，我们提出了语义-几何提示生成策略，以产生包含多种显著性线索的对应嵌入向量。在RGB-D和RGB-T显著目标检测基准数据集上的大量实验证明了所提框架的有效性。代码将在\url{https://github.com/Angknpng/Sammese}发布。