Referring video object segmentation (RVOS) relies on natural language expressions to segment target objects in video, emphasizing modeling dense text-video relations. The current RVOS methods typically use independently pre-trained vision and language models as backbones, resulting in a significant domain gap between video and text. In cross-modal feature interaction, text features are only used as query initialization and do not fully utilize important information in the text. In this work, we propose using frozen pre-trained vision-language models (VLM) as backbones, with a specific emphasis on enhancing cross-modal feature interaction. Firstly, we use frozen convolutional CLIP backbone to generate feature-aligned vision and text features, alleviating the issue of domain gap and reducing training costs. Secondly, we add more cross-modal feature fusion in the pipeline to enhance the utilization of multi-modal information. Furthermore, we propose a novel video query initialization method to generate higher quality video queries. Without bells and whistles, our method achieved 51.5 J&F on the MeViS test set and ranked 3rd place for MeViS Track in CVPR 2024 PVUW workshop: Motion Expression guided Video Segmentation.
翻译:参考视频目标分割(RVOS)依赖自然语言表达来分割视频中的目标对象,其重点在于建模密集的文本-视频关系。当前的 RVOS 方法通常使用独立预训练的视觉和语言模型作为主干网络,这导致视频与文本之间存在显著的领域差异。在跨模态特征交互中,文本特征仅被用作查询初始化,未能充分利用文本中的重要信息。在本工作中,我们提出使用冻结的预训练视觉-语言模型作为主干网络,并特别强调增强跨模态特征交互。首先,我们使用冻结的卷积 CLIP 主干网络来生成特征对齐的视觉和文本特征,从而缓解领域差异问题并降低训练成本。其次,我们在流程中增加了更多的跨模态特征融合,以增强对多模态信息的利用。此外,我们提出了一种新颖的视频查询初始化方法,以生成更高质量的视频查询。在未使用额外技巧的情况下,我们的方法在 MeViS 测试集上取得了 51.5 的 J&F 分数,并在 CVPR 2024 PVUW 研讨会的 MeViS 赛道(运动表达引导的视频分割)中排名第三。