Human body parsing remains a challenging problem in natural scenes due to multi-instance and inter-part semantic confusions as well as occlusions. This paper proposes a novel approach to decomposing multiple human bodies into semantic part regions in unconstrained environments. Specifically we propose a convolutional neural network (CNN) architecture which comprises of novel semantic and contour attention mechanisms across feature hierarchy to resolve the semantic ambiguities and boundary localization issues related to semantic body parsing. We further propose to encode estimated pose as higher-level contextual information which is combined with local semantic cues in a novel graphical model in a principled manner. In this proposed model, the lower-level semantic cues can be recursively updated by propagating higher-level contextual information from estimated pose and vice versa across the graph, so as to alleviate erroneous pose information and pixel level predictions. We further propose an optimization technique to efficiently derive the solutions. Our proposed method achieves the state-of-art results on the challenging Pascal Person-Part dataset.
翻译:在自然场景中,由于多实例、部件间语义混淆以及遮挡等问题,人体解析仍然是一个具有挑战性的任务。本文提出了一种新颖的方法,用于在无约束环境中将多个人体分解为语义部件区域。具体而言,我们提出了一种卷积神经网络(CNN)架构,该架构包含跨特征层次的新型语义与轮廓注意力机制,以解决与语义人体解析相关的语义模糊性和边界定位问题。我们进一步提出将估计的姿态编码为更高层次的上下文信息,并以一种原则性的方式,在一个新颖的图模型中将其与局部语义线索相结合。在所提出的模型中,较低层次的语义线索可以通过在图中传播来自估计姿态的更高层次上下文信息而递归更新,反之亦然,从而减轻错误的姿态信息和像素级预测。我们还提出了一种优化技术来高效求解。我们提出的方法在具有挑战性的Pascal Person-Part数据集上取得了最先进的结果。