Referring Video Object Segmentation (RVOS) aims to segment the object referred to by the query sentence throughout the entire video. Most existing methods require end-to-end training with dense mask annotations, which could be computation-consuming and less scalable. In this work, we aim to efficiently adapt foundation segmentation models for addressing RVOS from weak supervision with the proposed Grounded Prompting (GroPrompt) framework. More specifically, we propose Text-Aware Prompt Contrastive Learning (TAP-CL) to enhance the association between the position prompts and the referring sentences with only box supervisions, including Text-Contrastive Prompt Learning (TextCon) and Modality-Contrastive Prompt Learning (ModalCon) at frame level and video level, respectively. With the proposed TAP-CL, our GroPrompt framework can generate temporal-consistent yet text-aware position prompts describing locations and movements for the referred object from the video. The experimental results in the standard RVOS benchmarks (Ref-YouTube-VOS, Ref-DAVIS17, A2D-Sentences, and JHMDB-Sentences) demonstrate the competitive performance of our proposed GroPrompt framework given only bounding box weak supervisions.
翻译:指代视频目标分割(RVOS)旨在根据查询语句在整个视频中分割出被指代的目标。现有方法大多需要利用密集掩码标注进行端到端训练,这通常计算成本高昂且可扩展性有限。本研究旨在通过提出的接地提示(GroPrompt)框架,高效适配基础分割模型以利用弱监督解决RVOS问题。具体而言,我们提出文本感知提示对比学习(TAP-CL),仅利用边界框监督来增强位置提示与指代语句之间的关联,其中包含帧级别的文本对比提示学习(TextCon)和视频级别的模态对比提示学习(ModalCon)。通过所提出的TAP-CL,我们的GroPrompt框架能够从视频中生成具有时间一致性且文本感知的位置提示,用以描述被指代目标的位置与运动轨迹。在标准RVOS基准数据集(Ref-YouTube-VOS、Ref-DAVIS17、A2D-Sentences和JHMDB-Sentences)上的实验结果表明,在仅使用边界框弱监督的条件下,我们提出的GroPrompt框架取得了具有竞争力的性能。