DiffPose: SpatioTemporal Diffusion Model for Video-Based Human Pose Estimation

Denoising diffusion probabilistic models that were initially proposed for realistic image generation have recently shown success in various perception tasks (e.g., object detection and image segmentation) and are increasingly gaining attention in computer vision. However, extending such models to multi-frame human pose estimation is non-trivial due to the presence of the additional temporal dimension in videos. More importantly, learning representations that focus on keypoint regions is crucial for accurate localization of human joints. Nevertheless, the adaptation of the diffusion-based methods remains unclear on how to achieve such objective. In this paper, we present DiffPose, a novel diffusion architecture that formulates video-based human pose estimation as a conditional heatmap generation problem. First, to better leverage temporal information, we propose SpatioTemporal Representation Learner which aggregates visual evidences across frames and uses the resulting features in each denoising step as a condition. In addition, we present a mechanism called Lookup-based MultiScale Feature Interaction that determines the correlations between local joints and global contexts across multiple scales. This mechanism generates delicate representations that focus on keypoint regions. Altogether, by extending diffusion models, we show two unique characteristics from DiffPose on pose estimation task: (i) the ability to combine multiple sets of pose estimates to improve prediction accuracy, particularly for challenging joints, and (ii) the ability to adjust the number of iterative steps for feature refinement without retraining the model. DiffPose sets new state-of-the-art results on three benchmarks: PoseTrack2017, PoseTrack2018, and PoseTrack21.

翻译：最初用于真实图像生成的去噪扩散概率模型近期在多种感知任务（如目标检测和图像分割）中展现出成功，并在计算机视觉领域日益受到关注。然而，将此类模型扩展到多帧人体姿态估计并非易事，原因在于视频中存在额外的时间维度。更重要的是，学习聚焦于关键点区域的表征对于人体关节的准确定位至关重要。然而，基于扩散模型的方法如何实现这一目标仍不明确。本文提出DiffPose，一种新颖的扩散架构，将基于视频的人体姿态估计公式化为条件热力图生成问题。首先，为更好利用时间信息，我们提出时空表征学习器，该模块跨帧聚合视觉证据，并将所得特征作为每个去噪步骤的条件。此外，我们提出一种称为基于查找的多尺度特征交互机制，用于确定多尺度下局部关节与全局上下文之间的相关性。该机制生成聚焦于关键点区域的精细表征。综上，通过扩展扩散模型，我们展示了DiffPose在姿态估计任务中的两个独特特性：（i）能够组合多组姿态估计结果以提高预测精度，尤其针对具有挑战性的关节；（ii）无需重新训练模型即可调整迭代步骤数以实现特征精炼。DiffPose在PoseTrack2017、PoseTrack2018和PoseTrack21三个基准数据集上均取得了新的最优结果。