Despite significant efforts, cutting-edge video segmentation methods still remain sensitive to occlusion and rapid movement, due to their reliance on the appearance of objects in the form of object embeddings, which are vulnerable to these disturbances. A common solution is to use optical flow to provide motion information, but essentially it only considers pixel-level motion, which still relies on appearance similarity and hence is often inaccurate under occlusion and fast movement. In this work, we study the instance-level motion and present InstMove, which stands for Instance Motion for Object-centric Video Segmentation. In comparison to pixel-wise motion, InstMove mainly relies on instance-level motion information that is free from image feature embeddings, and features physical interpretations, making it more accurate and robust toward occlusion and fast-moving objects. To better fit in with the video segmentation tasks, InstMove uses instance masks to model the physical presence of an object and learns the dynamic model through a memory network to predict its position and shape in the next frame. With only a few lines of code, InstMove can be integrated into current SOTA methods for three different video segmentation tasks and boost their performance. Specifically, we improve the previous arts by 1.5 AP on OVIS dataset, which features heavy occlusions, and 4.9 AP on YouTubeVIS-Long dataset, which mainly contains fast-moving objects. These results suggest that instance-level motion is robust and accurate, and hence serving as a powerful solution in complex scenarios for object-centric video segmentation.
翻译:尽管已有诸多努力,前沿的视频分割方法仍对遮挡和快速运动敏感,这是因为这些方法依赖以目标嵌入形式呈现的物体外观特征,而此类特征易受上述干扰影响。常见的解决方案是使用光流提供运动信息,但光流本质上仅考虑像素级运动,仍依赖外观相似性,因此在遮挡和快速运动场景下经常不准确。本文研究实例级运动,并提出InstMove(Instance Motion for Object-centric Video Segmentation,面向目标中心视频分割的实例运动)。与像素级运动相比,InstMove主要依赖不受图像特征嵌入影响的实例级运动信息,并具有物理可解释性,从而对遮挡和快速运动物体更加准确和鲁棒。为更好地适配视频分割任务,InstMove使用实例掩码建模物体的物理存在,并通过记忆网络学习动态模型,以预测其在下一帧的位置和形状。仅需少量代码,InstMove即可集成至当前三种不同视频分割任务的最优方法中,并提升其性能。具体而言,我们在以严重遮挡为特征的OVIS数据集上提升了1.5 AP,在主要包含快速运动物体的YouTubeVIS-Long数据集上提升了4.9 AP。这些结果表明,实例级运动具有鲁棒性和准确性,因而可成为复杂场景中面向目标视频分割的有效解决方案。