This paper strives for motion expressions guided video segmentation, which focuses on segmenting objects in video content based on a sentence describing the motion of the objects. Existing referring video object datasets typically focus on salient objects and use language expressions that contain excessive static attributes that could potentially enable the target object to be identified in a single frame. These datasets downplay the importance of motion in video content for language-guided video object segmentation. To investigate the feasibility of using motion expressions to ground and segment objects in videos, we propose a large-scale dataset called MeViS, which contains numerous motion expressions to indicate target objects in complex environments. We benchmarked 5 existing referring video object segmentation (RVOS) methods and conducted a comprehensive comparison on the MeViS dataset. The results show that current RVOS methods cannot effectively address motion expression-guided video segmentation. We further analyze the challenges and propose a baseline approach for the proposed MeViS dataset. The goal of our benchmark is to provide a platform that enables the development of effective language-guided video segmentation algorithms that leverage motion expressions as a primary cue for object segmentation in complex video scenes. The proposed MeViS dataset has been released at https://henghuiding.github.io/MeViS.
翻译:本文致力于基于运动表达引导的视频分割,旨在根据描述物体运动的语句分割视频中的对象。现有的视频目标指代数据集通常聚焦显著对象,使用的语言表达包含过多静态属性,使得目标对象可在一帧内被识别。这些数据集弱化了视频内容中运动信息对语言引导式视频对象分割的重要性。为探究利用运动表达定位并分割视频中对象的可行性,我们提出名为MeViS的大规模数据集,包含大量用于指示复杂环境中目标对象的运动表达语句。我们针对5种现有视频对象指代分割(RVOS)方法进行基准测试,并在MeViS数据集上开展全面比较。结果表明,现有RVOS方法无法有效解决运动表达引导的视频分割问题。我们进一步分析挑战,并针对MeViS数据集提出基线方法。本基准测试旨在搭建平台,推动开发以运动表达为核心线索、面向复杂视频场景的优质语言引导式视频分割算法。MeViS数据集已发布于https://henghuiding.github.io/MeViS。