There has been a recent surge of interest in introducing transformers to 3D human pose estimation (HPE) due to their powerful capabilities in modeling long-term dependencies. However, existing transformer-based methods treat body joints as equally important inputs and ignore the prior knowledge of human skeleton topology in the self-attention mechanism. To tackle this issue, in this paper, we propose a Pose-Oriented Transformer (POT) with uncertainty guided refinement for 3D HPE. Specifically, we first develop novel pose-oriented self-attention mechanism and distance-related position embedding for POT to explicitly exploit the human skeleton topology. The pose-oriented self-attention mechanism explicitly models the topological interactions between body joints, whereas the distance-related position embedding encodes the distance of joints to the root joint to distinguish groups of joints with different difficulties in regression. Furthermore, we present an Uncertainty-Guided Refinement Network (UGRN) to refine pose predictions from POT, especially for the difficult joints, by considering the estimated uncertainty of each joint with uncertainty-guided sampling strategy and self-attention mechanism. Extensive experiments demonstrate that our method significantly outperforms the state-of-the-art methods with reduced model parameters on 3D HPE benchmarks such as Human3.6M and MPI-INF-3DHP
翻译:近年来,由于Transformer在建模长程依赖关系方面的强大能力,将其引入三维人体姿态估计(HPE)的研究兴趣激增。然而,现有的基于Transformer的方法将身体关节视为同等重要的输入,并在自注意力机制中忽略了人体骨骼拓扑结构的先验知识。为解决这一问题,本文提出了一种面向姿态的Transformer(POT),并结合不确定性引导精化用于三维HPE。具体而言,我们首先为POT开发了新颖的面向姿态的自注意力机制和距离相关位置嵌入,以显式利用人体骨骼拓扑结构。面向姿态的自注意力机制显式建模了身体关节之间的拓扑交互,而距离相关位置编码则对关节到根关节的距离进行编码,以区分回归难度不同的关节组。此外,我们提出了一种不确定性引导精化网络(UGRN),通过考虑每个关节的估计不确定性,采用不确定性引导的采样策略和自注意力机制,对POT的姿态预测进行精化,特别是针对困难关节。大量实验表明,我们的方法在三维HPE基准(如Human3.6M和MPI-INF-3DHP)上,以更少的模型参数显著优于当前最先进的方法。