Human pose estimation is a challenging task due to its structured data sequence nature. Existing methods primarily focus on pair-wise interaction of body joints, which is insufficient for scenarios involving overlapping joints and rapidly changing poses. To overcome these issues, we introduce a novel approach, the High-order Directed Transformer (HDFormer), which leverages high-order bone and joint relationships for improved pose estimation. Specifically, HDFormer incorporates both self-attention and high-order attention to formulate a multi-order attention module. This module facilitates first-order "joint$\leftrightarrow$joint", second-order "bone$\leftrightarrow$joint", and high-order "hyperbone$\leftrightarrow$joint" interactions, effectively addressing issues in complex and occlusion-heavy situations. In addition, modern CNN techniques are integrated into the transformer-based architecture, balancing the trade-off between performance and efficiency. HDFormer significantly outperforms state-of-the-art (SOTA) models on Human3.6M and MPI-INF-3DHP datasets, requiring only 1/10 of the parameters and significantly lower computational costs. Moreover, HDFormer demonstrates broad real-world applicability, enabling real-time, accurate 3D pose estimation. The source code is in https://github.com/hyer/HDFormer
翻译:摘要:人体姿态估计因数据序列的结构化特性而具有挑战性。现有方法主要关注身体关节的成对交互,这在涉及关节重叠和快速姿态变化的场景下存在局限性。为克服这些问题,我们提出了一种新颖方法——高阶有向Transformer(HDFormer),利用高阶骨骼与关节关系提升姿态估计性能。具体而言,HDFormer融合自注意力与高阶注意力,构建多阶注意力模块。该模块支持一阶“关节↔关节”、二阶“骨骼↔关节”及高阶“超骨骼↔关节”交互,有效解决了复杂及重度遮挡场景中的难题。此外,我们将现代CNN技术集成到基于Transformer的架构中,平衡了性能与效率。HDFormer在Human3.6M和MPI-INF-3DHP数据集上显著优于现有最优(SOTA)模型,仅需其1/10的参数且计算成本大幅降低。同时,HDFormer展现了广泛的现实应用潜力,支持实时且精确的三维姿态估计。源代码见https://github.com/hyer/HDFormer。