Human motion reconstruction from monocular videos is a fundamental challenge in computer vision, with broad applications in AR/VR, robotics, and digital content creation, but remains challenging under frequent occlusions in real-world settings.Existing regression-based methods are efficient but fragile to missing observations, while optimization- and diffusion-based approaches improve robustness at the cost of slow inference speed and heavy preprocessing steps. To address these limitations, we leverage recent advances in generative masked modeling and present MoRo: Masked Modeling for human motion Recovery under Occlusions. MoRo is an occlusion-robust, end-to-end generative framework that formulates motion reconstruction as a video-conditioned task, and efficiently recover human motion in a consistent global coordinate system from RGB videos. By masked modeling, MoRo naturally handles occlusions while enabling efficient, end-to-end inference. To overcome the scarcity of paired video-motion data, we design a cross-modality learning scheme that learns multi-modal priors from a set of heterogeneous datasets: (i) a trajectory-aware motion prior trained on MoCap datasets, (ii) an image-conditioned pose prior trained on image-pose datasets, capturing diverse per-frame poses, and (iii) a video-conditioned masked transformer that fuses motion and pose priors, finetuned on video-motion datasets to integrate visual cues with motion dynamics for robust inference. Extensive experiments on EgoBody and RICH demonstrate that MoRo substantially outperforms state-of-the-art methods in accuracy and motion realism under occlusions, while performing on-par in non-occluded scenarios. MoRo achieves real-time inference at 70 FPS on a single H200 GPU.
翻译:从单目视频中重建人体运动是计算机视觉领域的一项基础性挑战,在增强现实/虚拟现实、机器人学和数字内容创作中具有广泛应用,但在现实场景中频繁出现的遮挡条件下仍极具挑战性。现有的基于回归的方法虽然高效,但对缺失观测数据非常敏感;而基于优化和扩散的方法虽能提升鲁棒性,却以推理速度缓慢和繁重的预处理步骤为代价。为克服这些局限,我们借鉴生成式掩码建模的最新进展,提出了MoRo:遮挡条件下基于掩码建模的人体运动恢复框架。MoRo是一个对遮挡鲁棒的端到端生成式框架,它将运动重建构建为视频条件生成任务,能够从RGB视频中高效恢复出全局坐标系一致的人体运动。通过掩码建模,MoRo能自然地处理遮挡,同时实现高效的端到端推理。为缓解配对视频-运动数据稀缺的问题,我们设计了一种跨模态学习方案,从一组异构数据集中学习多模态先验:(i) 在运动捕捉数据集上训练的轨迹感知运动先验,(ii) 在图像-姿态数据集上训练的以图像为条件的姿态先验,以捕捉多样化的逐帧姿态,以及(iii) 融合运动与姿态先验的、以视频为条件的掩码Transformer模型,该模型在视频-运动数据集上进行微调,以整合视觉线索与运动动态,实现鲁棒推理。在EgoBody和RICH数据集上的大量实验表明,MoRo在遮挡条件下的运动精度和真实感方面显著优于现有最优方法,同时在非遮挡场景下性能相当。MoRo在单块H200 GPU上实现了70 FPS的实时推理速度。