The graph convolutional networks (GCNs) have been applied to model the physically connected and non-local relations among human joints for 3D human pose estimation (HPE). In addition, the purely Transformer-based models recently show promising results in video-based 3D HPE. However, the single-frame method still needs to model the physically connected relations among joints because the feature representations transformed only by global relations via the Transformer neglect information on the human skeleton. To deal with this problem, we propose a novel method in which the Transformer encoder and GCN blocks are alternately stacked, namely AMPose, to combine the global and physically connected relations among joints towards HPE. In the AMPose, the Transformer encoder is applied to connect each joint with all the other joints, while GCNs are applied to capture information on physically connected relations. The effectiveness of our proposed method is evaluated on the Human3.6M dataset. Our model also shows better generalization ability by testing on the MPI-INF-3DHP dataset. Code can be retrieved at https://github.com/erikervalid/AMPose.
翻译:图卷积网络(GCNs)已被用于建模人体关节间的物理连接及非局部关系,以实现3D人体姿态估计(HPE)。此外,近期纯Transformer模型在基于视频的3D HPE中展现出良好效果。然而,单帧方法仍需建模关节间的物理连接关系,因为仅通过Transformer的全局关系变换得到的特征表示会忽略人体骨架信息。针对该问题,我们提出一种新颖方法——交替堆叠Transformer编码器与GCN模块的AMPose,以融合关节间的全局关系与物理连接关系。在AMPose中,Transformer编码器用于连接每个关节与所有其他关节,而GCNs则用于捕获物理连接关系的信息。在Human3.6M数据集上的评估验证了所提方法的有效性。通过MPI-INF-3DHP数据集的测试,我们的模型还展现出更强的泛化能力。代码可在https://github.com/erikervalid/AMPose获取。