Referring Video Object Segmentation (RVOS) is a challenging task due to its requirement for temporal understanding. Due to the obstacle of computational complexity, many state-of-the-art models are trained on short time intervals. During testing, while these models can effectively process information over short time steps, they struggle to maintain consistent perception over prolonged time sequences, leading to inconsistencies in the resulting semantic segmentation masks. To address this challenge, we take a step further in this work by leveraging the tracking capabilities of the newly introduced Segment Anything Model version 2 (SAM-v2) to enhance the temporal consistency of the referring object segmentation model. Our method achieved a score of 60.40 \mathcal{J\text{\&}F} on the test set of the MeViS dataset, placing 2nd place in the final ranking of the RVOS Track at the ECCV 2024 LSVOS Challenge.
翻译:参考视频目标分割(RVOS)因其对时序理解的要求而成为一项具有挑战性的任务。由于计算复杂度的限制,许多先进模型仅在短时间间隔上进行训练。在测试阶段,尽管这些模型能够有效处理短时间步内的信息,但在处理长时间序列时难以保持一致的感知能力,导致生成的语义分割掩码出现不一致性。为应对这一挑战,本研究进一步利用新引入的Segment Anything Model版本2(SAM-v2)的跟踪能力,以增强参考目标分割模型的时序一致性。我们的方法在MeViS数据集的测试集上获得了60.40的\mathcal{J\text{\&}F}分数,在ECCV 2024 LSVOS挑战赛RVOS赛道的最终排名中位列第二。