Estimating human and camera trajectories with accurate scale in the world coordinate system from a monocular video is a highly desirable yet challenging and ill-posed problem. In this study, we aim to recover expressive parametric human models (i.e., SMPL-X) and corresponding camera poses jointly, by leveraging the synergy between three critical players: the world, the human, and the camera. Our approach is founded on two key observations. Firstly, camera-frame SMPL-X estimation methods readily recover absolute human depth. Secondly, human motions inherently provide absolute spatial cues. By integrating these insights, we introduce a novel framework, referred to as WHAC, to facilitate world-grounded expressive human pose and shape estimation (EHPS) alongside camera pose estimation, without relying on traditional optimization techniques. Additionally, we present a new synthetic dataset, WHAC-A-Mole, which includes accurately annotated humans and cameras, and features diverse interactive human motions as well as realistic camera trajectories. Extensive experiments on both standard and newly established benchmarks highlight the superiority and efficacy of our framework. We will make the code and dataset publicly available.
翻译:从单目视频中恢复世界坐标系下具有精确尺度的人体与相机轨迹是一项极具吸引力但充满挑战且不适定的任务。本研究旨在通过协同利用世界、人体和相机三个关键要素,联合恢复具有表现力的参数化人体模型(即SMPL-X)及对应的相机位姿。该方法基于两个关键观察:首先,基于相机坐标系的SMPL-X估计方法能够直接恢复人体的绝对深度;其次,人体运动本身蕴含绝对空间线索。基于这些认知,我们提出一种名为WHAC的新型框架,无需依赖传统优化技术即可实现世界坐标系下的表现性人体姿态与形状估计(EHPS)及相机姿态估计。此外,我们构建了全新的合成数据集WHAC-A-Mole,该数据集包含精确标注的人体与相机数据,并涵盖多样化交互性人体运动及真实感相机轨迹。在标准基准与新建基准上的大量实验充分验证了我们框架的优越性和有效性。代码与数据集将开源发布。