Video portrait segmentation (VPS), aiming at segmenting prominent foreground portraits from video frames, has received much attention in recent years. However, simplicity of existing VPS datasets leads to a limitation on extensive research of the task. In this work, we propose a new intricate large-scale Multi-scene Video Portrait Segmentation dataset MVPS consisting of 101 video clips in 7 scenario categories, in which 10,843 sampled frames are finely annotated at pixel level. The dataset has diverse scenes and complicated background environments, which is the most complex dataset in VPS to our best knowledge. Through the observation of a large number of videos with portraits during dataset construction, we find that due to the joint structure of human body, motion of portraits is part-associated, which leads that different parts are relatively independent in motion. That is, motion of different parts of the portraits is imbalanced. Towards this imbalance, an intuitive and reasonable idea is that different motion states in portraits can be better exploited by decoupling the portraits into parts. To achieve this, we propose a Part-Decoupling Network (PDNet) for video portrait segmentation. Specifically, an Inter-frame Part-Discriminated Attention (IPDA) module is proposed which unsupervisedly segments portrait into parts and utilizes different attentiveness on discriminative features specified to each different part. In this way, appropriate attention can be imposed to portrait parts with imbalanced motion to extract part-discriminated correlations, so that the portraits can be segmented more accurately. Experimental results demonstrate that our method achieves leading performance with the comparison to state-of-the-art methods.
翻译:视频人像分割(VPS)旨在从视频帧中分割出显著的前景人像,近年来备受关注。然而,现有VPS数据集的简单性限制了该任务的深入研究。本文提出一个新颖复杂的大规模多场景视频人像分割数据集MVPS,包含7种场景类别下的101个视频片段,其中10,843个采样帧均在像素级别进行了精细标注。该数据集涵盖多样化场景与复杂背景环境,据我们所知是当前VPS领域最复杂的数据集。在数据集构建过程中,通过对大量含人像视频的观察,我们发现由于人体关节结构的存在,人像运动具有部位关联性,导致不同部位在运动中相对独立。换言之,人像不同部位的运动呈现非平衡特性。针对这种非平衡性,一种直观合理的思路是通过将人像解耦为不同部位,以更好地利用人像中的差异化运动状态。为此,我们提出用于视频人像分割的部分解耦网络(PDNet)。具体而言,我们设计了帧间部位判别注意力模块(IPDA),该模块以无监督方式将人像分割为不同部位,并针对每个部位的特异性判别特征施加差异化注意力。通过这种方式,可以对具有非平衡运动的各人像部位施加恰当的注意力以提取部位判别相关性,从而实现更精确的人像分割。实验结果表明,与现有先进方法相比,我们提出的方法取得了领先性能。