Towards Unbalanced Motion: Part-Decoupling Network for Video Portrait Segmentation

Video portrait segmentation (VPS), aiming at segmenting prominent foreground portraits from video frames, has received much attention in recent years. However, simplicity of existing VPS datasets leads to a limitation on extensive research of the task. In this work, we propose a new intricate large-scale Multi-scene Video Portrait Segmentation dataset MVPS consisting of 101 video clips in 7 scenario categories, in which 10,843 sampled frames are finely annotated at pixel level. The dataset has diverse scenes and complicated background environments, which is the most complex dataset in VPS to our best knowledge. Through the observation of a large number of videos with portraits during dataset construction, we find that due to the joint structure of human body, motion of portraits is part-associated, which leads that different parts are relatively independent in motion. That is, motion of different parts of the portraits is unbalanced. Towards this unbalance, an intuitive and reasonable idea is that different motion states in portraits can be better exploited by decoupling the portraits into parts. To achieve this, we propose a Part-Decoupling Network (PDNet) for video portrait segmentation. Specifically, an Inter-frame Part-Discriminated Attention (IPDA) module is proposed which unsupervisely segments portrait into parts and utilizes different attentiveness on discriminative features specified to each different part. In this way, appropriate attention can be imposed to portrait parts with unbalanced motion to extract part-discriminated correlations, so that the portraits can be segmented more accurately. Experimental results demonstrate that our method achieves leading performance with the comparison to state-of-the-art methods.

翻译：视频人像分割（VPS）旨在从视频帧中分割出显著的前景人像，近年来受到了广泛关注。然而，现有VPS数据集的简单性限制了该任务的深入研究。本文提出了一个新颖且复杂的大规模多场景视频人像分割数据集MVPS，包含7个场景类别下的101个视频片段，其中10,843个采样帧得到了像素级精细标注。该数据集场景多样、背景环境复杂，据我们所知是目前VPS领域最复杂的数据集。在数据集构建过程中，通过对大量人像视频的观察，我们发现由于人体关节结构，人像运动具有部位关联性，导致不同部位在运动上相对独立。也就是说，人像不同部位的运动是不平衡的。针对这种不平衡性，一个直观而合理的思路是通过将人像解耦为不同部位，来更好地利用人像中的不同运动状态。为此，我们提出了一种用于视频人像分割的部分解耦网络（PDNet）。具体地，提出了一种帧间部分判别注意力模块（IPDA），该模块无监督地将人像分割为多个部位，并对每个部位的不同判别性特征施加不同的注意力权重。通过这种方式，可以对运动不平衡的人像部位施加适当的注意力，以提取部位判别性的相关性，从而更准确地分割人像。实验结果表明，与最先进方法相比，我们的方法取得了领先性能。