One of the mainstream schemes for 2D human pose estimation (HPE) is learning keypoints heatmaps by a neural network. Existing methods typically improve the quality of heatmaps by customized architectures, such as high-resolution representation and vision Transformers. In this paper, we propose \textbf{DiffusionPose}, a new scheme that formulates 2D HPE as a keypoints heatmaps generation problem from noised heatmaps. During training, the keypoints are diffused to random distribution by adding noises and the diffusion model learns to recover ground-truth heatmaps from noised heatmaps with respect to conditions constructed by image feature. During inference, the diffusion model generates heatmaps from initialized heatmaps in a progressive denoising way. Moreover, we further explore improving the performance of DiffusionPose with conditions from human structural information. Extensive experiments show the prowess of our DiffusionPose, with improvements of 1.6, 1.2, and 1.2 mAP on widely-used COCO, CrowdPose, and AI Challenge datasets, respectively.
翻译:二维人体姿态估计的主流方案之一是通过神经网络学习关键点热力图。现有方法通常通过定制化架构(如高分辨率表征和视觉Transformer)提升热力图质量。本文提出**DiffusionPose**——一种将二维人体姿态估计转化为基于噪声热力图的关键点热力图生成问题的新范式。训练阶段,通过添加噪声将关键点扩散为随机分布,扩散模型学习基于图像特征构建的条件信息,从噪声热力图恢复真实热力图;推理阶段,扩散模型以渐进式去噪方式从初始化热力图生成热力图。此外,我们进一步探索利用人体结构信息构建条件来提升DiffusionPose的性能。大量实验表明,DiffusionPose展现出卓越性能,在广泛使用的COCO、CrowdPose和AI Challenge数据集上分别提升1.6、1.2和1.2 mAP。