We present RopeTP, a novel framework that combines Robust pose estimation with a diffusion Trajectory Prior to reconstruct global human motion from videos. At the heart of RopeTP is a hierarchical attention mechanism that significantly improves context awareness, which is essential for accurately inferring the posture of occluded body parts. This is achieved by exploiting the relationships with visible anatomical structures, enhancing the accuracy of local pose estimations. The improved robustness of these local estimations allows for the reconstruction of precise and stable global trajectories. Additionally, RopeTP incorporates a diffusion trajectory model that predicts realistic human motion from local pose sequences. This model ensures that the generated trajectories are not only consistent with observed local actions but also unfold naturally over time, thereby improving the realism and stability of 3D human motion reconstruction. Extensive experimental validation shows that RopeTP surpasses current methods on two benchmark datasets, particularly excelling in scenarios with occlusions. It also outperforms methods that rely on SLAM for initial camera estimates and extensive optimization, delivering more accurate and realistic trajectories.
翻译:本文提出RopeTP,一种将鲁棒姿态估计与扩散轨迹先验相结合的新型框架,用于从视频中重建全局人体运动。RopeTP的核心是分层注意力机制,该机制显著提升了上下文感知能力——这对于准确推断被遮挡身体部位的姿态至关重要。该机制通过利用与可见解剖结构的关系来实现,从而提高了局部姿态估计的准确性。局部估计鲁棒性的增强使得精确且稳定的全局轨迹重建成为可能。此外,RopeTP引入了一个扩散轨迹模型,该模型能够从局部姿态序列中预测逼真的人体运动。该模型确保生成的轨迹不仅与观测到的局部动作一致,而且能随时间自然展开,从而提升了三维人体运动重建的真实感与稳定性。大量实验验证表明,RopeTP在两个基准数据集上超越了现有方法,尤其在存在遮挡的场景中表现优异。相较于依赖SLAM进行初始相机估计和大量优化的方法,RopeTP能够提供更准确、更真实的轨迹。