We address the problem of regressing 3D human pose and shape from a single image, with a focus on 3D accuracy. The current best methods leverage large datasets of 3D pseudo-ground-truth (p-GT) and 2D keypoints, leading to robust performance. With such methods, we observe a paradoxical decline in 3D pose accuracy with increasing 2D accuracy. This is caused by biases in the p-GT and the use of an approximate camera projection model. We quantify the error induced by current camera models and show that fitting 2D keypoints and p-GT accurately causes incorrect 3D poses. Our analysis defines the invalid distances within which minimizing 2D and p-GT losses is detrimental. We use this to formulate a new loss Threshold-Adaptive Loss Scaling (TALS) that penalizes gross 2D and p-GT losses but not smaller ones. With such a loss, there are many 3D poses that could equally explain the 2D evidence. To reduce this ambiguity we need a prior over valid human poses but such priors can introduce unwanted bias. To address this, we exploit a tokenized representation of human pose and reformulate the problem as token prediction. This restricts the estimated poses to the space of valid poses, effectively providing a uniform prior. Extensive experiments on the EMDB and 3DPW datasets show that our reformulated keypoint loss and tokenization allows us to train on in-the-wild data while improving 3D accuracy over the state-of-the-art. Our models and code are available for research at https://tokenhmr.is.tue.mpg.de.
翻译:我们聚焦于从单张图像回归三维人体姿态与形状的问题,尤其关注三维精度。当前最优方法利用大规模三维伪真值(p-GT)与二维关键点数据集,取得了稳健的性能。然而我们发现,此类方法中随着二维精度的提升,三维姿态精度反而出现反常下降。这一现象源于伪真值中的偏差及近似相机投影模型的使用。我们量化了现有相机模型引入的误差,证明对二维关键点与伪真值的精确拟合会导致错误的三维姿态。通过分析,我们定义了二维损失与伪真值损失会产生负面效果的无效距离范围,并据此提出新型阈值自适应损失缩放(TALS)方法——仅惩罚较大幅度的二维与伪真值损失,而保留小幅损失。该损失函数下存在多种能同等解释二维证据的三维姿态。为降低这种歧义性需要引入有效人体姿态先验,但此类先验可能带来不必要的偏差。为此,我们利用人体姿态的分词表示将问题重构为token预测,将估计姿态约束在有效姿态空间内,实质形成均匀先验。在EMDB与3DPW数据集上的大量实验表明,我们重构的关键点损失与分词化方法能在利用野外数据训练的同时,提升三维精度至超越当前最优水平。我们的模型与代码已开源供研究使用:https://tokenhmr.is.tue.mpg.de。