Recent transformer-based approaches have demonstrated excellent performance in 3D human pose estimation. However, they have a holistic view and by encoding global relationships between all the joints, they do not capture the local dependencies precisely. In this paper, we present a novel Attention-GCNFormer (AGFormer) block that divides the number of channels by using two parallel transformer and GCNFormer streams. Our proposed GCNFormer module exploits the local relationship between adjacent joints, outputting a new representation that is complementary to the transformer output. By fusing these two representation in an adaptive way, AGFormer exhibits the ability to better learn the underlying 3D structure. By stacking multiple AGFormer blocks, we propose MotionAGFormer in four different variants, which can be chosen based on the speed-accuracy trade-off. We evaluate our model on two popular benchmark datasets: Human3.6M and MPI-INF-3DHP. MotionAGFormer-B achieves state-of-the-art results, with P1 errors of 38.4mm and 16.2mm, respectively. Remarkably, it uses a quarter of the parameters and is three times more computationally efficient than the previous leading model on Human3.6M dataset. Code and models are available at https://github.com/TaatiTeam/MotionAGFormer.
翻译:近期基于Transformer的方法在三维人体姿态估计中展现出卓越性能。然而,这类方法采用全局视角编码所有关节点之间的全局关系,无法精确捕捉局部依赖关系。本文提出一种新型注意力-图卷积Transformer(Attention-GCNFormer,AGFormer)模块,通过采用两条并行的Transformer和GCNFormer流对通道数进行划分。所提出的GCNFormer模块利用相邻关节点间的局部关系,输出与Transformer输出互补的新表征。通过自适应融合这两种表征,AGFormer展现出更优的底层三维结构学习能力。通过堆叠多个AGFormer模块,我们提出四种不同变体的MotionAGFormer,可根据速度-精度权衡进行选择。我们在两个主流基准数据集Human3.6M和MPI-INF-3DHP上评估模型性能。MotionAGFormer-B达到了最优结果,P1误差分别为38.4mm和16.2mm。值得注意的是,与Human3.6M数据集上此前领先的模型相比,其参数量仅为其四分之一,计算效率提升三倍。代码和模型已开源至https://github.com/TaatiTeam/MotionAGFormer。