Video-based human pose estimation (VHPE) is a vital yet challenging task. While deep learning methods have made significant progress for the VHPE, most approaches to this task implicitly model the long-range interaction between joints by enlarging the receptive field of the convolution. Unlike prior methods, we design a lightweight and plug-and-play joint relation extractor (JRE) to model the associative relationship between joints explicitly and automatically. The JRE takes the pseudo heatmaps of joints as input and calculates the similarity between pseudo heatmaps. In this way, the JRE flexibly learns the relationship between any two joints, allowing it to learn the rich spatial configuration of human poses. Moreover, the JRE can infer invisible joints according to the relationship between joints, which is beneficial for the model to locate occluded joints. Then, combined with temporal semantic continuity modeling, we propose a Relation-based Pose Semantics Transfer Network (RPSTN) for video-based human pose estimation. Specifically, to capture the temporal dynamics of poses, the pose semantic information of the current frame is transferred to the next with a joint relation guided pose semantics propagator (JRPSP). The proposed model can transfer the pose semantic features from the non-occluded frame to the occluded frame, making our method robust to the occlusion. Furthermore, the proposed JRE module is also suitable for image-based human pose estimation. The proposed RPSTN achieves state-of-the-art results on the video-based Penn Action dataset, Sub-JHMDB dataset, and PoseTrack2018 dataset. Moreover, the proposed JRE improves the performance of backbones on the image-based COCO2017 dataset. Code is available at https://github.com/YHDang/pose-estimation.
翻译:视频人体姿态估计(VHPE)是一项至关重要且具有挑战性的任务。虽然深度学习方法在VHPE领域取得了显著进展,但大多数方法通过扩大卷积的感受野来隐式建模关节之间的长程交互。与先前方法不同,我们设计了一种轻量级且即插即用的关节关系提取器(JRE),以显式且自动地建模关节间的关联关系。JRE以关节的伪热图作为输入,计算伪热图之间的相似度。通过这种方式,JRE灵活地学习任意两个关节之间的关系,从而能够学习人体姿态丰富的空间配置。此外,JRE可根据关节间的关系推断不可见关节,这有助于模型定位被遮挡的关节。然后,结合时序语义连续性建模,我们提出了一种基于关系的姿态语义迁移网络(RPSTN),用于视频人体姿态估计。具体而言,为捕捉姿态的时序动态,当前帧的姿态语义信息通过关节关系引导的姿态语义传播器(JRPSP)传递至下一帧。该模型能够将姿态语义特征从无遮挡帧迁移至被遮挡帧,使我们的方法对遮挡具有鲁棒性。此外,所提出的JRE模块同样适用于图像人体姿态估计。所提出的RPSTN在视频领域的Penn Action数据集、Sub-JHMDB数据集和PoseTrack2018数据集上达到了最先进的结果。同时,所提出的JRE在图像领域的COCO2017数据集上提升了骨干网络的性能。代码开源地址:https://github.com/YHDang/pose-estimation。