Transformer-based methods have recently achieved significant success in 3D human pose estimation, owing to their strong ability to model long-range dependencies. However, relying solely on the global attention mechanism is insufficient for capturing the fine-grained local details, which are crucial for accurate pose estimation. To address this, we propose SSR-STF, a dual-stream model that effectively integrates local features with global dependencies to enhance 3D human pose estimation. Specifically, we introduce SSRFormer, a simple yet effective module that employs the skeleton selective refine attention (SSRA) mechanism to capture fine-grained local dependencies in human pose sequences, complementing the global dependencies modeled by the Transformer. By adaptively fusing these two feature streams, SSR-STF can better learn the underlying structure of human poses, overcoming the limitations of traditional methods in local feature extraction. Extensive experiments on the Human3.6M and MPI-INF-3DHP datasets demonstrate that SSR-STF achieves state-of-the-art performance, with P1 errors of 37.4 mm and 13.2 mm respectively, outperforming existing methods in both accuracy and generalization. Furthermore, the motion representations learned by our model prove effective in downstream tasks such as human mesh recovery. Codes are available at https://github.com/poker-xu/SSR-STF.
翻译:基于Transformer的方法近年来在三维人体姿态估计领域取得了显著成功,这得益于其强大的长程依赖建模能力。然而,仅依赖全局注意力机制不足以捕捉细粒度的局部细节,而这些细节对于精确的姿态估计至关重要。为解决这一问题,我们提出了SSR-STF,一种双流模型,能够有效整合局部特征与全局依赖关系,以增强三维人体姿态估计。具体而言,我们引入了SSRFormer,这是一个简单而有效的模块,采用骨架选择性精炼注意力(SSRA)机制来捕捉人体姿态序列中的细粒度局部依赖关系,从而补充了Transformer所建模的全局依赖关系。通过自适应地融合这两个特征流,SSR-STF能够更好地学习人体姿态的底层结构,克服了传统方法在局部特征提取方面的局限性。在Human3.6M和MPI-INF-3DHP数据集上进行的大量实验表明,SSR-STF实现了最先进的性能,其P1误差分别为37.4毫米和13.2毫米,在准确性和泛化能力上均优于现有方法。此外,我们模型学习到的运动表示在下游任务(如人体网格恢复)中被证明是有效的。代码可在https://github.com/poker-xu/SSR-STF获取。