Conventional video matting outputs one alpha matte for all instances appearing in a video frame so that individual instances are not distinguished. While video instance segmentation provides time-consistent instance masks, results are unsatisfactory for matting applications, especially due to applied binarization. To remedy this deficiency, we propose Video Instance Matting~(VIM), that is, estimating alpha mattes of each instance at each frame of a video sequence. To tackle this challenging problem, we present MSG-VIM, a Mask Sequence Guided Video Instance Matting neural network, as a novel baseline model for VIM. MSG-VIM leverages a mixture of mask augmentations to make predictions robust to inaccurate and inconsistent mask guidance. It incorporates temporal mask and temporal feature guidance to improve the temporal consistency of alpha matte predictions. Furthermore, we build a new benchmark for VIM, called VIM50, which comprises 50 video clips with multiple human instances as foreground objects. To evaluate performances on the VIM task, we introduce a suitable metric called Video Instance-aware Matting Quality~(VIMQ). Our proposed model MSG-VIM sets a strong baseline on the VIM50 benchmark and outperforms existing methods by a large margin. The project is open-sourced at https://github.com/SHI-Labs/VIM.
翻译:传统视频抠图为视频帧中所有实例输出单一alpha遮罩,无法区分各个实例。尽管视频实例分割能提供时间一致的实例掩码,但其结果在抠图应用中仍不尽人意,尤其体现在二值化处理过程中。为解决这一缺陷,我们提出视频实例抠图(VIM),即针对视频序列每帧中的每个实例估计其alpha遮罩。为应对这一挑战性难题,我们提出掩码序列引导的视频实例抠图神经网络MSG-VIM作为VIM的新型基线模型。MSG-VIM通过融合多种掩码增强策略,使预测结果对不准确且不一致的掩码引导具有鲁棒性。该模型结合时序掩码与时序特征引导,提升了alpha遮罩预测的时间一致性。此外,我们构建了VIM专用基准测试数据集VIM50,包含50个以多个人类实例为前景对象的视频片段。为评估VIM任务性能,我们引入名为视频实例感知抠图质量(VIMQ)的适配指标。提出的MSG-VIM模型在VIM50基准测试中建立了强基线,并以显著优势超越现有方法。该项目已在https://github.com/SHI-Labs/VIM开源。