FMPose3D: monocular 3D pose estimation via flow matching

Monocular 3D pose estimation is fundamentally ill-posed due to depth ambiguity and occlusions, thereby motivating probabilistic methods that generate multiple plausible 3D pose hypotheses. In particular, diffusion-based models have recently demonstrated strong performance, but their iterative denoising process typically requires many timesteps for each prediction, making inference computationally expensive. In contrast, we leverage Flow Matching (FM) to learn a velocity field defined by an Ordinary Differential Equation (ODE), enabling efficient generation of 3D pose samples with only a few integration steps. We propose a novel generative pose estimation framework, FMPose3D, that formulates 3D pose estimation as a conditional distribution transport problem. It continuously transports samples from a standard Gaussian prior to the distribution of plausible 3D poses conditioned only on 2D inputs. Although ODE trajectories are deterministic, FMPose3D naturally generates various pose hypotheses by sampling different noise seeds. To obtain a single accurate prediction from those hypotheses, we further introduce a Reprojection-based Posterior Expectation Aggregation (RPEA) module, which approximates the Bayesian posterior expectation over 3D hypotheses. FMPose3D surpasses existing methods on the widely used human pose estimation benchmarks Human3.6M and MPI-INF-3DHP, and further achieves state-of-the-art performance on the 3D animal pose datasets Animal3D and CtrlAni3D, demonstrating strong performance across both 3D pose domains. The code is available at https://github.com/AdaptiveMotorControlLab/FMPose3D.

翻译：单目三维姿态估计由于深度模糊性和遮挡问题，本质上是病态的，这促使了能够生成多个合理三维姿态假设的概率方法的发展。特别是，基于扩散的模型近期展现出强大的性能，但其迭代去噪过程通常需要大量时间步长来完成每次预测，导致推理计算成本高昂。相比之下，我们利用流匹配（Flow Matching, FM）来学习一个由常微分方程（ODE）定义的速度场，从而仅需少量积分步长即可高效生成三维姿态样本。我们提出了一种新颖的生成式姿态估计框架FMPose3D，它将三维姿态估计表述为一个条件分布传输问题。该框架持续地将样本从一个标准高斯先验分布传输到仅由二维输入条件化的合理三维姿态分布。尽管ODE轨迹是确定性的，但FMPose3D通过采样不同的噪声种子，自然地生成多种姿态假设。为了从这些假设中获得一个准确的单一预测，我们进一步引入了基于重投影的后验期望聚合（RPEA）模块，该模块近似计算三维假设上的贝叶斯后验期望。FMPose3D在广泛使用的人体姿态估计基准数据集Human3.6M和MPI-INF-3DHP上超越了现有方法，并在三维动物姿态数据集Animal3D和CtrlAni3D上进一步取得了最先进的性能，证明了其在两个三维姿态领域均具有强大的性能。代码可在 https://github.com/AdaptiveMotorControlLab/FMPose3D 获取。