Referring Video Object Segmentation (R-VOS) methods face challenges in maintaining consistent object segmentation due to temporal context variability and the presence of other visually similar objects. We propose an end-to-end R-VOS paradigm that explicitly models temporal instance consistency alongside the referring segmentation. Specifically, we introduce a novel hybrid memory that facilitates inter-frame collaboration for robust spatio-temporal matching and propagation. Features of frames with automatically generated high-quality reference masks are propagated to segment the remaining frames based on multi-granularity association to achieve temporally consistent R-VOS. Furthermore, we propose a new Mask Consistency Score (MCS) metric to evaluate the temporal consistency of video segmentation. Extensive experiments demonstrate that our approach enhances temporal consistency by a significant margin, leading to top-ranked performance on popular R-VOS benchmarks, i.e., Ref-YouTube-VOS (67.1%) and Ref-DAVIS17 (65.6%). The code is available at https://github.com/bo-miao/HTR.
翻译:参考视频目标分割(R-VOS)方法因时序上下文变化及存在其他视觉相似物体而面临保持目标分割一致性的挑战。本文提出一种端到端的R-VOS范式,在实现参考分割的同时显式建模时序实例一致性。具体而言,我们设计了一种新型混合记忆机制,通过帧间协作实现鲁棒的时空匹配与传播。基于多粒度关联,将自动生成高质量参考掩码的帧特征传播至其余帧进行分割,从而实现时序一致的R-VOS。此外,我们提出新的掩码一致性评分(MCS)指标以评估视频分割的时序一致性。大量实验表明,本方法显著提升了时序一致性,在主流R-VOS基准测试(即Ref-YouTube-VOS(67.1%)和Ref-DAVIS17(65.6%))中取得领先性能。代码发布于https://github.com/bo-miao/HTR。