The graph convolutional networks (GCNs) have been applied to model the physically connected and non-local relations among human joints for 3D human pose estimation (HPE). In addition, the purely Transformer-based models recently show promising results in video-based 3D HPE. However, the single-frame method still needs to model the physically connected relations among joints because the feature representations transformed only by global relations via the Transformer neglect information on the human skeleton. To deal with this problem, we propose a novel method in which the Transformer encoder and GCN blocks are alternately stacked, namely AMPose, to combine the global and physically connected relations among joints towards HPE. In the AMPose, the Transformer encoder is applied to connect each joint with all the other joints, while GCNs are applied to capture information on physically connected relations. The effectiveness of our proposed method is evaluated on the Human3.6M dataset. Our model also shows better generalization ability by testing on the MPI-INF-3DHP dataset. Code can be retrieved at https://github.com/erikervalid/AMPose.
翻译:图卷积网络(GCNs)已被应用于对人体关节间的物理连接关系和非局部关系进行建模,以实现三维人体姿态估计(HPE)。此外,近期基于纯Transformer的模型在视频三维HPE中展现出良好前景。然而,单帧方法仍需对关节间的物理连接关系进行建模,因为仅通过Transformer的全局关系转换得到的特征表示忽略了人体骨骼信息。为解决该问题,我们提出一种将Transformer编码器与GCN模块交替堆叠的新方法(即AMPose),以融合HPE中关节间的全局关系与物理连接关系。在AMPose中,Transformer编码器用于连接每个关节与其他所有关节,而GCNs则用于捕获物理连接关系信息。我们在Human3.6M数据集上验证了所提方法的有效性。通过在MPI-INF-3DHP数据集上的测试,我们的模型还展现出更强的泛化能力。代码可从https://github.com/erikervalid/AMPose获取。