Disentangled Diffusion-Based 3D Human Pose Estimation with Hierarchical Spatial and Temporal Denoiser

Recently, diffusion-based methods for monocular 3D human pose estimation have achieved state-of-the-art (SOTA) performance by directly regressing the 3D joint coordinates from the 2D pose sequence. Although some methods decompose the task into bone length and bone direction prediction based on the human anatomical skeleton to explicitly incorporate more human body prior constraints, the performance of these methods is significantly lower than that of the SOTA diffusion-based methods. This can be attributed to the tree structure of the human skeleton. Direct application of the disentangled method could amplify the accumulation of hierarchical errors, propagating through each hierarchy. Meanwhile, the hierarchical information has not been fully explored by the previous methods. To address these problems, a Disentangled Diffusion-based 3D Human Pose Estimation method with Hierarchical Spatial and Temporal Denoiser is proposed, termed DDHPose. In our approach: (1) We disentangle the 3D pose and diffuse the bone length and bone direction during the forward process of the diffusion model to effectively model the human pose prior. A disentanglement loss is proposed to supervise diffusion model learning. (2) For the reverse process, we propose Hierarchical Spatial and Temporal Denoiser (HSTDenoiser) to improve the hierarchical modeling of each joint. Our HSTDenoiser comprises two components: the Hierarchical-Related Spatial Transformer (HRST) and the Hierarchical-Related Temporal Transformer (HRTT). HRST exploits joint spatial information and the influence of the parent joint on each joint for spatial modeling, while HRTT utilizes information from both the joint and its hierarchical adjacent joints to explore the hierarchical temporal correlations among joints.

翻译：近期，基于扩散的单目三维人体姿态估计方法通过直接从二维姿态序列回归三维关节点坐标，已取得最先进的性能。尽管部分方法基于人体解剖学骨架将任务分解为骨长和骨向预测，以显式引入更多人体制约先验，但其性能显著低于最先进的扩散方法。这归因于人体骨架的树状结构——直接应用解耦方法会放大层级误差的累积，导致误差在各层级间传播。与此同时，现有方法尚未充分挖掘层级信息。为解决上述问题，本文提出一种基于解耦扩散与层级时空去噪器的三维人体姿态估计方法（DDHPose）。具体而言：（1）在扩散模型的前向过程中解耦三维姿态，对骨长和骨向进行扩散以有效建模人体姿态先验，并设计解耦损失以监督扩散模型学习；（2）针对反向过程，提出层级时空去噪器（HSTDenoiser）以增强各关节的层级建模能力。HSTDenoiser包含层级相关空间变换器（HRST）与层级相关时间变换器（HRTT）两个组件。HRST通过挖掘关节点空间信息及父节点对子节点的影响实现空间建模，而HRTT则联合利用关节点及其层级相邻关节的信息探索关节点间的层级时间相关性。