In recent years, 2D-to-3D pose uplifting in monocular 3D Human Pose Estimation (HPE) has attracted widespread research interest. GNN-based methods and Transformer-based methods have become mainstream architectures due to their advanced spatial and temporal feature learning capacities. However, existing approaches typically construct joint-wise and frame-wise attention alignments in spatial and temporal domains, resulting in dense connections that introduce considerable local redundancy and computational overhead. In this paper, we take a global approach to exploit spatio-temporal information and realise efficient 3D HPE with a concise Graph and Skipped Transformer architecture. Specifically, in Spatial Encoding stage, coarse-grained body parts are deployed to construct Spatial Graph Network with a fully data-driven adaptive topology, ensuring model flexibility and generalizability across various poses. In Temporal Encoding and Decoding stages, a simple yet effective Skipped Transformer is proposed to capture long-range temporal dependencies and implement hierarchical feature aggregation. A straightforward Data Rolling strategy is also developed to introduce dynamic information into 2D pose sequence. Extensive experiments are conducted on Human3.6M, MPI-INF-3DHP and Human-Eva benchmarks. G-SFormer series methods achieve superior performances compared with previous state-of-the-arts with only around ten percent of parameters and significantly reduced computational complexity. Additionally, G-SFormer also exhibits outstanding robustness to inaccuracies in detected 2D poses.
翻译:近年来,单目三维人体姿态估计中的二维到三维姿态提升方法引起了广泛的研究兴趣。基于图神经网络的方法和基于Transformer的方法凭借其先进的时空特征学习能力已成为主流架构。然而,现有方法通常在空间域和时间域构建关节级和帧级注意力对齐,导致密集连接,从而引入显著的局部冗余和计算开销。本文采用全局方法利用时空信息,并通过简洁的图与跳跃Transformer架构实现高效的三维人体姿态估计。具体而言,在空间编码阶段,我们采用粗粒度身体部位构建具有完全数据驱动的自适应拓扑的空间图网络,确保模型在不同姿态下的灵活性和泛化能力。在时间编码与解码阶段,我们提出了一种简单而有效的跳跃Transformer来捕获长程时间依赖关系并实现分层特征聚合。同时开发了一种直接的数据滚动策略,将动态信息引入二维姿态序列。我们在Human3.6M、MPI-INF-3DHP和Human-Eva基准上进行了大量实验。G-SFormer系列方法仅需约十分之一的参数量且计算复杂度显著降低,即取得了优于先前最先进方法的性能。此外,G-SFormer对检测到的二维姿态不准确性也表现出卓越的鲁棒性。