Capturing the motion of every joint: 3D human pose and shape estimation with independent tokens

In this paper we present a novel method to estimate 3D human pose and shape from monocular videos. This task requires directly recovering pixel-alignment 3D human pose and body shape from monocular images or videos, which is challenging due to its inherent ambiguity. To improve precision, existing methods highly rely on the initialized mean pose and shape as prior estimates and parameter regression with an iterative error feedback manner. In addition, video-based approaches model the overall change over the image-level features to temporally enhance the single-frame feature, but fail to capture the rotational motion at the joint level, and cannot guarantee local temporal consistency. To address these issues, we propose a novel Transformer-based model with a design of independent tokens. First, we introduce three types of tokens independent of the image feature: \textit{joint rotation tokens, shape token, and camera token}. By progressively interacting with image features through Transformer layers, these tokens learn to encode the prior knowledge of human 3D joint rotations, body shape, and position information from large-scale data, and are updated to estimate SMPL parameters conditioned on a given image. Second, benefiting from the proposed token-based representation, we further use a temporal model to focus on capturing the rotational temporal information of each joint, which is empirically conducive to preventing large jitters in local parts. Despite being conceptually simple, the proposed method attains superior performances on the 3DPW and Human3.6M datasets. Using ResNet-50 and Transformer architectures, it obtains 42.0 mm error on the PA-MPJPE metric of the challenging 3DPW, outperforming state-of-the-art counterparts by a large margin. Code will be publicly available at https://github.com/yangsenius/INT_HMR_Model

翻译：本文提出一种从单目视频估计三维人体姿态与形状的新方法。该任务需要从单目图像或视频中直接恢复像素对齐的三维人体姿态与体形，因其固有歧义性而极具挑战性。现有方法为提高精度，高度依赖初始化的平均姿态与形状作为先验估计，并采用迭代误差反馈机制进行参数回归。此外，基于视频的方法通过建模图像级特征的整体变化来时序增强单帧特征，但未能捕捉关节级别的旋转运动，亦无法保证局部时间一致性。为解决这些问题，我们提出一种基于Transformer且采用独立令牌设计的新模型。首先，我们引入三类独立于图像特征的令牌：关节旋转令牌、形状令牌与相机令牌。这些令牌通过Transformer层与图像特征逐步交互，从大规模数据中学习编码人体三维关节旋转、体形及位置信息的先验知识，并根据给定图像更新以估计SMPL参数。其次，得益于所提出的令牌表征，我们进一步使用时序模型专注于捕捉各关节的旋转时序信息，这在经验上有助于防止局部部位产生剧烈抖动。尽管概念简单，所提方法在3DPW和Human3.6M数据集上取得了优越性能。采用ResNet-50与Transformer架构，其在具有挑战性的3DPW数据集上的PA-MPJPE指标上达到42.0毫米误差，大幅超越当前最先进方法。代码将开源至https://github.com/yangsenius/INT_HMR_Model。