The dominant paradigm in 3D human pose estimation that lifts a 2D pose sequence to 3D heavily relies on long-term temporal clues (i.e., using a daunting number of video frames) for improved accuracy, which incurs performance saturation, intractable computation and the non-causal problem. This can be attributed to their inherent inability to perceive spatial context as plain 2D joint coordinates carry no visual cues. To address this issue, we propose a straightforward yet powerful solution: leveraging the readily available intermediate visual representations produced by off-the-shelf (pre-trained) 2D pose detectors -- no finetuning on the 3D task is even needed. The key observation is that, while the pose detector learns to localize 2D joints, such representations (e.g., feature maps) implicitly encode the joint-centric spatial context thanks to the regional operations in backbone networks. We design a simple baseline named Context-Aware PoseFormer to showcase its effectiveness. Without access to any temporal information, the proposed method significantly outperforms its context-agnostic counterpart, PoseFormer, and other state-of-the-art methods using up to hundreds of video frames regarding both speed and precision. Project page: https://qitaozhao.github.io/ContextAware-PoseFormer
翻译:三维人体姿态估计的主流范式是将二维姿态序列提升至三维,其精度高度依赖长期时间线索(即使用大量视频帧),这导致性能饱和、计算复杂且存在非因果问题。究其原因,在于该方法无法感知空间上下文——因为纯二维关节坐标不包含视觉线索。为解决此问题,我们提出一种简洁而强大的方案:直接利用现成(预训练)二维姿态检测器生成的中间视觉表征——甚至无需对三维任务进行微调。关键发现是:当姿态检测器学习定位二维关节时,此类表征(如特征图)通过骨干网络中的区域操作隐式编码了以关节为中心的空间上下文。我们设计了一个名为Context-Aware PoseFormer的简单基线模型来展示其有效性。在无需任何时间信息的情况下,本方法在速度和精度上均显著优于其上下文无关的对应方法PoseFormer,以及使用多达数百帧视频的其他最先进方法。项目页面:https://qitaozhao.github.io/ContextAware-PoseFormer