The estimation of 3D human motion from video has progressed rapidly but current methods still have several key limitations. First, most methods estimate the human in camera coordinates. Second, prior work on estimating humans in global coordinates often assumes a flat ground plane and produces foot sliding. Third, the most accurate methods rely on computationally expensive optimization pipelines, limiting their use to offline applications. Finally, existing video-based methods are surprisingly less accurate than single-frame methods. We address these limitations with WHAM (World-grounded Humans with Accurate Motion), which accurately and efficiently reconstructs 3D human motion in a global coordinate system from video. WHAM learns to lift 2D keypoint sequences to 3D using motion capture data and fuses this with video features, integrating motion context and visual information. WHAM exploits camera angular velocity estimated from a SLAM method together with human motion to estimate the body's global trajectory. We combine this with a contact-aware trajectory refinement method that lets WHAM capture human motion in diverse conditions, such as climbing stairs. WHAM outperforms all existing 3D human motion recovery methods across multiple in-the-wild benchmarks. Code will be available for research purposes at http://wham.is.tue.mpg.de/
翻译:从视频中估计3D人体运动已取得显著进展,但当前方法仍存在若干关键局限。首先,多数方法在相机坐标系下进行人体估计。其次,以往关于全局坐标系人体估计的研究常假设平坦地面并产生脚部滑动。第三,最精确的方法依赖计算成本高昂的优化流程,仅限于离线应用。最后,现有视频方法精度反而低于单帧方法。针对这些局限,我们提出WHAM(全局坐标系下精确运动的人体重建),能高效从视频中精确重建全局坐标系下的3D人体运动。WHAM通过学习将2D关键点序列提升至3D,利用运动捕捉数据并融合视频特征,整合运动上下文与视觉信息。该方法通过SLAM估计的相机角速度结合人体运动推算全局轨迹,并引入接触感知轨迹优化技术,使其能适应爬楼梯等多样场景。WHAM在多个野外基准测试中超越所有现有3D人体运动恢复方法。研究代码将开源至http://wham.is.tue.mpg.de/。