Recovering 3D human poses from a monocular camera view is a highly ill-posed problem due to the depth ambiguity. Earlier studies on 3D human pose lifting from 2D often contain incorrect-yet-overconfident 3D estimations. To mitigate the problem, emerging probabilistic approaches treat the 3D estimations as a distribution, taking into account the uncertainty measurement of the poses. Falling in a similar category, we proposed FMPose, a probabilistic 3D human pose estimation method based on the flow matching generative approach. Conditioned on the 2D cues, the flow matching scheme learns the optimal transport from a simple source distribution to the plausible 3D human pose distribution via continuous normalizing flows. The 2D lifting condition is modeled via graph convolutional networks, leveraging the learnable connections between human body joints as the graph structure for feature aggregation. Compared to diffusion-based methods, the FMPose with optimal transport produces faster and more accurate 3D pose generations. Experimental results show major improvements of our FMPose over current state-of-the-art methods on three common benchmarks for 3D human pose estimation, namely Human3.6M, MPI-INF-3DHP and 3DPW.
翻译:从单目相机视角恢复三维人体姿态是一个高度不适定问题,主要源于深度模糊性。早期关于从二维姿态提升至三维的研究常包含错误却过度自信的三维估计。为缓解此问题,新兴的概率方法将三维估计视为分布,并考虑姿态的不确定性度量。基于同类思路,我们提出FMPose——一种基于流匹配生成方法的概率式三维人体姿态估计方法。该方法以二维线索为条件,通过连续归一化流学习从简单源分布到合理三维人体姿态分布的最优传输。二维提升条件通过图卷积网络建模,利用人体关节间可学习的连接作为图结构进行特征聚合。与基于扩散的方法相比,采用最优传输的FMPose能生成更快且更准确的三维姿态。实验结果表明,在Human3.6M、MPI-INF-3DHP和3DPW这三个三维人体姿态估计常用基准数据集上,我们的FMPose相较当前最先进方法取得了显著提升。