Conventional video matting outputs one alpha matte for all instances appearing in a video frame so that individual instances are not distinguished. While video instance segmentation provides time-consistent instance masks, results are unsatisfactory for matting applications, especially due to applied binarization. To remedy this deficiency, we propose Video Instance Matting~(VIM), that is, estimating alpha mattes of each instance at each frame of a video sequence. To tackle this challenging problem, we present MSG-VIM, a Mask Sequence Guided Video Instance Matting neural network, as a novel baseline model for VIM. MSG-VIM leverages a mixture of mask augmentations to make predictions robust to inaccurate and inconsistent mask guidance. It incorporates temporal mask and temporal feature guidance to improve the temporal consistency of alpha matte predictions. Furthermore, we build a new benchmark for VIM, called VIM50, which comprises 50 video clips with multiple human instances as foreground objects. To evaluate performances on the VIM task, we introduce a suitable metric called Video Instance-aware Matting Quality~(VIMQ). Our proposed model MSG-VIM sets a strong baseline on the VIM50 benchmark and outperforms existing methods by a large margin. The project is open-sourced at https://github.com/SHI-Labs/VIM.
翻译:传统视频抠像输出视频帧中所有实例的单一Alpha遮罩,因此无法区分不同实例。尽管视频实例分割能提供时间一致的实例掩码,但其结果对抠像应用而言并不理想,尤其是二值化处理带来的局限性。为解决这一缺陷,我们提出视频实例抠像(VIM),即估计视频序列每一帧中每个实例的Alpha遮罩。针对这一挑战性问题,我们提出掩码序列引导的视频实例抠像神经网络(MSG-VIM),作为VIM的新型基线模型。MSG-VIM通过混合掩码增强策略提升模型对不精确、不一致掩码引导的鲁棒性,并融合时序掩码与时序特征引导以改善Alpha遮罩预测的时间一致性。此外,我们构建了面向VIM的新基准VIM50,包含50个以多人物实例作为前景的视频片段。为评估VIM任务性能,我们引入适配性指标——视频实例感知抠像质量(VIMQ)。所提出的MSG-VIM模型在VIM50基准上建立了强基线,并以显著优势超越现有方法。项目已开源至https://github.com/SHI-Labs/VIM。