HDFormer: High-order Directed Transformer for 3D Human Pose Estimation

Human pose estimation is a complicated structured data sequence modeling task. Most existing methods only consider the pair-wise interaction of human body joints in model learning. Unfortunately, this causes 3D pose estimation to fail in difficult cases such as $\textit{joints overlapping}$, and pose $\textit{fast-changing}$, as pair-wise relations cannot exploit fine-grained human body priors in pose estimation. To this end, we revamped the 3D pose estimation framework with a $\textit{High-order}$ $\textit{Directed}$ $\textit{Transformer}$ (HDFormer), which coherently exploits the high-order bones and joints relevances to boost the performance of pose estimation. Specifically, HDFormer adopts both self-attention and high-order attention schemes to build up a multi-order attention module to perform the information flow interaction including the first-order $"\textit{joint$\leftrightarrow$joint}"$, second-order $"\textit{bone$\leftrightarrow$joint}"$ as well as high-order $"\textit{hyperbone$\leftrightarrow$joint}"$ relationships (hyperbone is defined as a joint set), compensating the hard cases prediction in fast-changing and heavy occlusion scenarios. Moreover, modernized CNN techniques are applied to upgrade the transformer-based architecture to speed up the HDFormer, achieving a favorable trade-off between effectiveness and efficiency. We compare our model with other SOTA models on the datasets Human3.6M and MPI-INF-3DHP. The results demonstrate that the proposed HDFormer achieves superior performance with only $\textbf{1/10}$ parameters and much lower computational cost compared to the current SOTAs. Moreover, HDFormer can be applied to various types of real-world applications, enabling real-time and accurate 3D pose estimation. The source code is in https://github.com/hyer/HDFormer.

翻译：人体姿态估计是一项复杂的结构化数据序列建模任务。现有方法大多仅考虑模型学习中人体关节的两两交互。然而，由于两两关系无法利用姿态估计中人体的细粒度先验信息，这导致在诸如$\textit{关节重叠}$和姿态$\textit{快速变化}$的困难情况下，三维姿态估计难以有效进行。为此，我们提出一种$\textit{高阶有向Transformer}$（HDFormer）来重塑三维姿态估计框架，该框架能够连贯地利用高阶骨骼与关节的相关性，从而提升姿态估计性能。具体而言，HDFormer采用自注意力与高阶注意力机制，构建一个多阶注意力模块，实现信息流交互，包括一阶的“$\textit{关节$\leftrightarrow$关节}$”、二阶的“$\textit{骨骼$\leftrightarrow$关节}$”以及高阶的“$\textit{超骨骼$\leftrightarrow$关节}$”关系（超骨骼定义为关节集合），从而在快速变化和严重遮挡场景中补偿困难样本的预测。此外，我们采用现代化的CNN技术升级基于Transformer的架构以加速HDFormer，实现了效果与效率之间的有利平衡。我们将模型与其他SOTA模型在Human3.6M和MPI-INF-3DHP数据集上进行比较。结果表明，所提出的HDFormer仅用当前SOTA模型$\textbf{1/10}$的参数和极低计算成本即实现了卓越性能。此外，HDFormer可应用于各类实际场景，实现实时且准确的三维人体姿态估计。源代码请见https://github.com/hyer/HDFormer。