MeViS: A Large-scale Benchmark for Video Segmentation with Motion Expressions

This paper strives for motion expressions guided video segmentation, which focuses on segmenting objects in video content based on a sentence describing the motion of the objects. Existing referring video object datasets typically focus on salient objects and use language expressions that contain excessive static attributes that could potentially enable the target object to be identified in a single frame. These datasets downplay the importance of motion in video content for language-guided video object segmentation. To investigate the feasibility of using motion expressions to ground and segment objects in videos, we propose a large-scale dataset called MeViS, which contains numerous motion expressions to indicate target objects in complex environments. We benchmarked 5 existing referring video object segmentation (RVOS) methods and conducted a comprehensive comparison on the MeViS dataset. The results show that current RVOS methods cannot effectively address motion expression-guided video segmentation. We further analyze the challenges and propose a baseline approach for the proposed MeViS dataset. The goal of our benchmark is to provide a platform that enables the development of effective language-guided video segmentation algorithms that leverage motion expressions as a primary cue for object segmentation in complex video scenes. The proposed MeViS dataset has been released at https://henghuiding.github.io/MeViS.

翻译：本文致力于基于运动表达引导的视频分割，旨在根据描述物体运动的语句分割视频中的对象。现有的视频目标指代数据集通常聚焦显著对象，使用的语言表达包含过多静态属性，使得目标对象可在一帧内被识别。这些数据集弱化了视频内容中运动信息对语言引导式视频对象分割的重要性。为探究利用运动表达定位并分割视频中对象的可行性，我们提出名为MeViS的大规模数据集，包含大量用于指示复杂环境中目标对象的运动表达语句。我们针对5种现有视频对象指代分割（RVOS）方法进行基准测试，并在MeViS数据集上开展全面比较。结果表明，现有RVOS方法无法有效解决运动表达引导的视频分割问题。我们进一步分析挑战，并针对MeViS数据集提出基线方法。本基准测试旨在搭建平台，推动开发以运动表达为核心线索、面向复杂视频场景的优质语言引导式视频分割算法。MeViS数据集已发布于https://henghuiding.github.io/MeViS。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日