Recent multi-frame lifting methods have dominated the 3D human pose estimation. However, previous methods ignore the intricate dependence within the 2D pose sequence and learn single temporal correlation. To alleviate this limitation, we propose TCPFormer, which leverages an implicit pose proxy as an intermediate representation. Each proxy within the implicit pose proxy can build one temporal correlation therefore helping us learn more comprehensive temporal correlation of human motion. Specifically, our method consists of three key components: Proxy Update Module (PUM), Proxy Invocation Module (PIM), and Proxy Attention Module (PAM). PUM first uses pose features to update the implicit pose proxy, enabling it to store representative information from the pose sequence. PIM then invocates and integrates the pose proxy with the pose sequence to enhance the motion semantics of each pose. Finally, PAM leverages the above mapping between the pose sequence and pose proxy to enhance the temporal correlation of the whole pose sequence. Experiments on the Human3.6M and MPI-INF-3DHP datasets demonstrate that our proposed TCPFormer outperforms the previous state-of-the-art methods.
翻译:近年来,多帧提升方法在三维人体姿态估计领域占据主导地位。然而,先前的方法忽略了二维姿态序列内部复杂的依赖关系,仅学习单一的时间相关性。为缓解这一局限,我们提出了TCPFormer,该方法利用隐式姿态代理作为中间表示。隐式姿态代理中的每个代理可以构建一种时间相关性,从而帮助我们学习更全面的人体运动时间相关性。具体而言,我们的方法包含三个关键组件:代理更新模块(PUM)、代理调用模块(PIM)和代理注意力模块(PAM)。PUM首先利用姿态特征更新隐式姿态代理,使其能够存储来自姿态序列的代表性信息。PIM随后调用姿态代理并将其与姿态序列融合,以增强每个姿态的运动语义。最后,PAM利用上述姿态序列与姿态代理之间的映射关系,增强整个姿态序列的时间相关性。在Human3.6M和MPI-INF-3DHP数据集上的实验表明,我们提出的TCPFormer优于先前的最先进方法。