RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation

The Segment Anything Model (SAM) has gained significant attention for its impressive performance in image segmentation. However, it lacks proficiency in referring video object segmentation (RVOS) due to the need for precise user-interactive prompts and a limited understanding of different modalities, such as language and vision. This paper presents the RefSAM model, which explores the potential of SAM for RVOS by incorporating multi-view information from diverse modalities and successive frames at different timestamps in an online manner. Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-Modal MLP that projects the text embedding of the referring expression into sparse and dense embeddings, serving as user-interactive prompts. Additionally, we have introduced the hierarchical dense attention module to fuse hierarchical visual semantic information with sparse embeddings in order to obtain fine-grained dense embeddings, and an implicit tracking module to generate a track token and provide historical information for the mask decoder. Furthermore, we employ a parameter-efficient tuning strategy to effectively align and fuse the language and vision features. Through comprehensive ablation studies, we demonstrate the practical and effective design choices of our model. Extensive experiments conducted on Ref-Youtu-VOS, Ref-DAVIS17, and three referring image segmentation datasets validate the superiority and effectiveness of our RefSAM model over existing methods. The code and models will be made publicly at \href{https://github.com/LancasterLi/RefSAM}{github.com/LancasterLi/RefSAM}.

翻译：分割一切模型（SAM）因其在图像分割中的卓越表现而备受关注。然而，由于需要精确的用户交互式提示以及对不同模态（如语言和视觉）的理解有限，该模型难以应对指代视频对象分割（RVOS）任务。本文提出RefSAM模型，通过在线方式融合来自不同模态的多视角信息以及不同时间戳的连续帧，探索SAM在RVOS任务中的潜力。我们的方法采用轻量级跨模态MLP，将指代表达式的文本嵌入投影为稀疏嵌入和密集嵌入，作为用户交互式提示，从而适配原始SAM模型以增强跨模态学习。此外，我们引入层次化密集注意力模块，将层次化视觉语义信息与稀疏嵌入融合以获取细粒度密集嵌入，并设计隐式追踪模块生成追踪令牌，为掩码解码器提供历史信息。同时，我们采用参数高效微调策略，以有效对齐并融合语言与视觉特征。全面的消融研究验证了模型设计选择的实用性与有效性。在Ref-Youtu-VOS、Ref-DAVIS17及三个指代图像分割数据集上的大量实验表明，我们的RefSAM模型较现有方法具有优越性与有效性。代码与模型将公开于\href{https://github.com/LancasterLi/RefSAM}{github.com/LancasterLi/RefSAM}。