Monocular 3D human pose estimation (3D-HPE) is an inherently ambiguous task, as a 2D pose in an image might originate from different possible 3D poses. Yet, most 3D-HPE methods rely on regression models, which assume a one-to-one mapping between inputs and outputs. In this work, we provide theoretical and empirical evidence that, because of this ambiguity, common regression models are bound to predict topologically inconsistent poses, and that traditional evaluation metrics, such as the MPJPE, P-MPJPE and PCK, are insufficient to assess this aspect. As a solution, we propose ManiPose, a novel manifold-constrained multi-hypothesis model capable of proposing multiple candidate 3D poses for each 2D input, together with their corresponding plausibility. Unlike previous multi-hypothesis approaches, our solution is completely supervised and does not rely on complex generative models, thus greatly facilitating its training and usage. Furthermore, by constraining our model to lie within the human pose manifold, we can guarantee the consistency of all hypothetical poses predicted with our approach, which was not possible in previous works. We illustrate the usefulness of ManiPose in a synthetic 1D-to-2D lifting setting and demonstrate on real-world datasets that it outperforms state-of-the-art models in pose consistency by a large margin, while still reaching competitive MPJPE performance.
翻译:单目三维人体姿态估计(3D-HPE)是一项本质模糊的任务,因为图像中的二维姿态可能源自不同的三维姿态。然而,大多数3D-HPE方法依赖回归模型,该模型假设输入与输出之间存在一一映射。本研究从理论和实证两方面证明,由于这种模糊性,常见的回归模型必然预测出拓扑不一致的姿态,而传统评估指标(如MPJPE、P-MPJPE和PCK)不足以评估这一方面。为此,我们提出ManiPose——一种新颖的流形约束多假设模型,能够为每个二维输入提出多个候选三维姿态及其相应的合理性。与以往的多假设方法不同,本方案完全采用监督学习,不依赖复杂的生成模型,从而极大便利了训练和使用。此外,通过将模型约束在人体姿态流形内,我们能够保证本方法预测的所有假设姿态的一致性,而前人工作未能实现这一点。我们在合成的一维到二维升维场景中验证了ManiPose的有效性,并在真实数据集上证明,其姿态一致性大幅超越现有最优模型,同时仍能达到具备竞争力的MPJPE性能。