基于上下文表征学习的双流Transformer-GCN单目三维人体姿态估计模型 (Dual-stream Transformer-GCN Model with Contextualized Representations Learning for Monocular 3D Human Pose Estimation)

This paper introduces a novel approach to monocular 3D human pose estimation using contextualized representation learning with the Transformer-GCN dual-stream model. Monocular 3D human pose estimation is challenged by depth ambiguity, limited 3D-labeled training data, imbalanced modeling, and restricted model generalization. To address these limitations, our work introduces a groundbreaking motion pre-training method based on contextualized representation learning. Specifically, our method involves masking 2D pose features and utilizing a Transformer-GCN dual-stream model to learn high-dimensional representations through a self-distillation setup. By focusing on contextualized representation learning and spatial-temporal modeling, our approach enhances the model's ability to understand spatial-temporal relationships between postures, resulting in superior generalization. Furthermore, leveraging the Transformer-GCN dual-stream model, our approach effectively balances global and local interactions in video pose estimation. The model adaptively integrates information from both the Transformer and GCN streams, where the GCN stream effectively learns local relationships between adjacent key points and frames, while the Transformer stream captures comprehensive global spatial and temporal features. Our model achieves state-of-the-art performance on two benchmark datasets, with an MPJPE of 38.0mm and P-MPJPE of 31.9mm on Human3.6M, and an MPJPE of 15.9mm on MPI-INF-3DHP. Furthermore, visual experiments on public datasets and in-the-wild videos demonstrate the robustness and generalization capabilities of our approach.

翻译：本文提出了一种基于上下文表征学习的Transformer-GCN双流模型用于单目三维人体姿态估计的新方法。单目三维人体姿态估计面临深度歧义性、三维标注训练数据有限、建模不平衡以及模型泛化能力受限等挑战。为克服这些局限，本研究引入了一种基于上下文表征学习的突破性运动预训练方法。具体而言，该方法通过掩码二维姿态特征，并利用Transformer-GCN双流模型在自蒸馏框架下学习高维表征。通过聚焦于上下文表征学习与时空建模，本方法增强了模型理解姿态间时空关系的能力，从而实现了卓越的泛化性能。此外，借助Transformer-GCN双流模型，本方法有效平衡了视频姿态估计中全局与局部的交互作用。该模型自适应地整合Transformer流与GCN流的信息：GCN流有效学习相邻关键点与帧间的局部关系，而Transformer流则捕捉全面的全局时空特征。我们的模型在两个基准数据集上取得了最先进的性能：在Human3.6M数据集上MPJPE达到38.0毫米、P-MPJPE达到31.9毫米，在MPI-INF-3DHP数据集上MPJPE达到15.9毫米。此外，在公开数据集和真实场景视频上的可视化实验验证了本方法的鲁棒性与泛化能力。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/