Referring video object segmentation (RVOS), as a supervised learning task, relies on sufficient annotated data for a given scene. However, in more realistic scenarios, only minimal annotations are available for a new scene, which poses significant challenges to existing RVOS methods. With this in mind, we propose a simple yet effective model with a newly designed cross-modal affinity (CMA) module based on a Transformer architecture. The CMA module builds multimodal affinity with a few samples, thus quickly learning new semantic information, and enabling the model to adapt to different scenarios. Since the proposed method targets limited samples for new scenes, we generalize the problem as - few-shot referring video object segmentation (FS-RVOS). To foster research in this direction, we build up a new FS-RVOS benchmark based on currently available datasets. The benchmark covers a wide range and includes multiple situations, which can maximally simulate real-world scenarios. Extensive experiments show that our model adapts well to different scenarios with only a few samples, reaching state-of-the-art performance on the benchmark. On Mini-Ref-YouTube-VOS, our model achieves an average performance of 53.1 J and 54.8 F, which are 10% better than the baselines. Furthermore, we show impressive results of 77.7 J and 74.8 F on Mini-Ref-SAIL-VOS, which are significantly better than the baselines. Code is publicly available at https://github.com/hengliusky/Few_shot_RVOS.
翻译:指代视频目标分割(RVOS)作为一项监督学习任务,依赖于给定场景的充足标注数据。然而,在更现实的场景中,新场景通常仅有极少量标注可用,这对现有RVOS方法构成了重大挑战。基于此,我们提出了一种简单而有效的模型,该模型基于Transformer架构,并包含一个新设计的跨模态亲和力(CMA)模块。CMA模块通过少量样本构建多模态亲和关系,从而快速学习新的语义信息,使模型能够适应不同场景。由于所提方法针对新场景的有限样本,我们将该问题泛化为小样本指代视频目标分割(FS-RVOS)。为促进该方向的研究,我们基于现有数据集构建了一个新的FS-RVOS基准。该基准覆盖广泛领域并包含多种情境,能最大程度模拟真实世界场景。大量实验表明,我们的模型仅需少量样本即可良好适应不同场景,并在基准上达到了最先进性能。在Mini-Ref-YouTube-VOS上,我们的模型平均性能达到53.1 J和54.8 F,比基线方法提升了10%。此外,在Mini-Ref-SAIL-VOS上,我们展示了77.7 J和74.8 F的优异结果,显著优于基线方法。代码已在https://github.com/hengliusky/Few_shot_RVOS公开。