Flexible Geometric Guidance for Probabilistic Human Pose Estimation with Diffusion Models

3D human pose estimation from 2D images is a challenging problem due to depth ambiguity and occlusion. Because of these challenges the task is underdetermined, where there exists multiple -- possibly infinite -- poses that are plausible given the image. Despite this, many prior works assume the existence of a deterministic mapping and estimate a single pose given an image. Furthermore, methods based on machine learning require a large amount of paired 2D-3D data to train and suffer from generalization issues to unseen scenarios. To address both of these issues, we propose a framework for pose estimation using diffusion models, which enables sampling from a probability distribution over plausible poses which are consistent with a 2D image. Our approach falls under the guidance framework for conditional generation, and guides samples from an unconditional diffusion model, trained only on 3D data, using the gradients of the heatmaps from a 2D keypoint detector. We evaluate our method on the Human 3.6M dataset under best-of-$m$ multiple hypothesis evaluation, showing state-of-the-art performance among methods which do not require paired 2D-3D data for training. We additionally evaluate the generalization ability using the MPI-INF-3DHP and 3DPW datasets and demonstrate competitive performance. Finally, we demonstrate the flexibility of our framework by using it for novel tasks including pose generation and pose completion, without the need to train bespoke conditional models. We make code available at https://github.com/fsnelgar/diffusion_pose .

翻译：从二维图像进行三维人体姿态估计是一个具有挑战性的问题，主要源于深度模糊性和遮挡。由于这些挑战，该任务是不适定的，即给定图像可能存在多个——甚至无限多个——合理的姿态。尽管如此，许多先前工作假设存在一个确定性映射，并估计给定图像的单一姿态。此外，基于机器学习的方法需要大量成对的2D-3D数据进行训练，并且在泛化到未见场景时存在问题。为了解决这两个问题，我们提出了一个使用扩散模型的姿态估计框架，该框架能够从与二维图像一致的合理姿态的概率分布中进行采样。我们的方法属于条件生成的引导框架，它利用二维关键点检测器热图的梯度，来引导仅使用三维数据训练的无条件扩散模型的样本。我们在Human 3.6M数据集上使用最优-$m$多假设评估方法对我们的方法进行了评估，结果显示在无需成对2D-3D数据进行训练的方法中，我们的方法达到了最先进的性能。我们还使用MPI-INF-3DHP和3DPW数据集评估了泛化能力，并展示了具有竞争力的性能。最后，我们通过将其用于姿态生成和姿态补全等新任务，展示了我们框架的灵活性，而无需训练专门的条件模型。代码发布于 https://github.com/fsnelgar/diffusion_pose 。