Referring video object segmentation (RVOS) is a task that aims to segment the target object in all video frames based on a sentence describing the object. Previous RVOS methods have achieved significant performance with densely-annotated datasets, whose construction is expensive and time-consuming. To relieve the burden of data annotation while maintaining sufficient supervision for segmentation, we propose a new annotation scheme, in which we label the frame where the object first appears with a mask and use bounding boxes for the subsequent frames. Based on this scheme, we propose a method to learn from this weak annotation. Specifically, we design a cross frame segmentation method, which uses the language-guided dynamic filters to thoroughly leverage the valuable mask annotation and bounding boxes. We further develop a bi-level contrastive learning method to encourage the model to learn discriminative representation at the pixel level. Extensive experiments and ablative analyses show that our method is able to achieve competitive performance without the demand of dense mask annotation. The code will be available at https://github.com/wangbo-zhao/WRVOS/.
翻译:指代视频目标分割(RVOS)是一项旨在根据描述目标的句子,在视频所有帧中分割出目标对象的任务。现有RVOS方法依赖密集标注数据集取得了显著性能,但此类数据集的构建耗时且成本高昂。为减轻数据标注负担同时维持充分的分割监督信息,我们提出一种新标注方案:对目标首次出现的帧标注掩膜,而对后续帧使用边界框进行标注。基于该方案,我们提出从这种弱标注中学习的方法。具体而言,我们设计了一种跨帧分割方法,通过语言引导的动态滤波器充分挖掘有价值的掩膜标注与边界框信息。进一步,我们提出双层对比学习机制,促使模型在像素层面学习判别性表征。大量实验与消融分析表明,本方法无需密集掩膜标注即可达到具有竞争力的性能。代码将开源在https://github.com/wangbo-zhao/WRVOS/。