This paper studies referring video object segmentation (RVOS) by boosting video-level visual-linguistic alignment. Recent approaches model the RVOS task as a sequence prediction problem and perform multi-modal interaction as well as segmentation for each frame separately. However, the lack of a global view of video content leads to difficulties in effectively utilizing inter-frame relationships and understanding textual descriptions of object temporal variations. To address this issue, we propose Semantic-assisted Object Cluster (SOC), which aggregates video content and textual guidance for unified temporal modeling and cross-modal alignment. By associating a group of frame-level object embeddings with language tokens, SOC facilitates joint space learning across modalities and time steps. Moreover, we present multi-modal contrastive supervision to help construct well-aligned joint space at the video level. We conduct extensive experiments on popular RVOS benchmarks, and our method outperforms state-of-the-art competitors on all benchmarks by a remarkable margin. Besides, the emphasis on temporal coherence enhances the segmentation stability and adaptability of our method in processing text expressions with temporal variations. Code will be available.
翻译:本文研究通过增强视频级别的视觉-语言对齐来进行指代视频对象分割(RVOS)。现有方法将RVOS任务建模为序列预测问题,并分别对每一帧执行多模态交互和分割。然而,缺乏对视频内容的全局视角导致难以有效利用帧间关系以及理解对象时序变化的文本描述。为解决这一问题,我们提出语义辅助的目标聚类(SOC),该方法将视频内容与文本指导进行聚合,从而实现统一的时序建模和跨模态对齐。通过将一组帧级目标嵌入与语言令牌相关联,SOC促进了跨模态和时间步长的联合空间学习。此外,我们提出了多模态对比监督,以帮助在视频级别构建良好对齐的联合空间。我们在主流的RVOS基准上进行了大量实验,所提方法在所有基准上均以显著优势超越了现有最先进方法。同时,对时序一致性的强调增强了我们方法在处理含时序变化的文本表述时的分割稳定性和适应性。代码将公开。