FinePOSE: Fine-Grained Prompt-Driven 3D Human Pose Estimation via Diffusion Models

The 3D Human Pose Estimation (3D HPE) task uses 2D images or videos to predict human joint coordinates in 3D space. Despite recent advancements in deep learning-based methods, they mostly ignore the capability of coupling accessible texts and naturally feasible knowledge of humans, missing out on valuable implicit supervision to guide the 3D HPE task. Moreover, previous efforts often study this task from the perspective of the whole human body, neglecting fine-grained guidance hidden in different body parts. To this end, we present a new Fine-Grained Prompt-Driven Denoiser based on a diffusion model for 3D HPE, named \textbf{FinePOSE}. It consists of three core blocks enhancing the reverse process of the diffusion model: (1) Fine-grained Part-aware Prompt learning (FPP) block constructs fine-grained part-aware prompts via coupling accessible texts and naturally feasible knowledge of body parts with learnable prompts to model implicit guidance. (2) Fine-grained Prompt-pose Communication (FPC) block establishes fine-grained communications between learned part-aware prompts and poses to improve the denoising quality. (3) Prompt-driven Timestamp Stylization (PTS) block integrates learned prompt embedding and temporal information related to the noise level to enable adaptive adjustment at each denoising step. Extensive experiments on public single-human pose estimation datasets show that FinePOSE outperforms state-of-the-art methods. We further extend FinePOSE to multi-human pose estimation. Achieving 34.3mm average MPJPE on the EgoHumans dataset demonstrates the potential of FinePOSE to deal with complex multi-human scenarios. Code is available at https://github.com/PKU-ICST-MIPL/FinePOSE_CVPR2024.

翻译：摘要：三维人体姿态估计（3D HPE）任务利用二维图像或视频预测人体关节在三维空间中的坐标。尽管基于深度学习的近期方法取得了进展，但它们大多忽略了可获取文本与自然可行人体知识的耦合能力，从而错失了引导3D HPE任务的宝贵隐式监督。此外，以往工作常从整体人体视角研究该任务，忽视了隐藏在不同身体部位的细粒度指导。为此，我们提出一种基于扩散模型的新型细粒度提示驱动去噪器，命名为**FinePOSE**。其包含三个增强扩散模型逆向过程的核心模块：（1）细粒度部位感知提示学习（FPP）模块：通过耦合可获取文本、身体部位的自然可行知识及可学习提示，构建细粒度部位感知提示以建模隐式指导。（2）细粒度提示-姿态通信（FPC）模块：在学习到的部位感知提示与姿态之间建立细粒度通信，提升去噪质量。（3）提示驱动时间戳风格化（PTS）模块：整合学习到的提示嵌入与噪声水平相关的时间信息，实现每个去噪步骤的自适应调整。在公开单人姿态估计数据集上的大量实验表明，FinePOSE优于现有最先进方法。我们进一步将FinePOSE扩展至多人姿态估计任务，在EgoHumans数据集上达到34.3mm的平均MPJPE，证明了其处理复杂多人场景的潜力。代码已开源：https://github.com/PKU-ICST-MIPL/FinePOSE_CVPR2024。