Reconstructing human motion and its surrounding environment is crucial for understanding human-scene interaction and predicting human movements in the scene. While much progress has been made in capturing human-scene interaction in constrained environments, those prior methods can hardly reconstruct the natural and diverse human motion and scene context from web videos. In this work, we propose JOSH, a novel optimization-based method for 4D human-scene reconstruction in the wild from monocular videos. JOSH uses techniques in both dense scene reconstruction and human mesh recovery as initialization, and then it leverages the human-scene contact constraints to jointly optimize the scene, the camera poses, and the human motion. Experiment results show JOSH achieves better results on both global human motion estimation and dense scene reconstruction by joint optimization of scene geometry and human motion. We further design a more efficient model, JOSH3R, and directly train it with pseudo-labels from web videos. JOSH3R outperforms other optimization-free methods by only training with labels predicted from JOSH, further demonstrating its accuracy and generalization ability.
翻译:重建人体运动及其周围环境对于理解人-场景交互及预测场景中人体运动至关重要。虽然在受限环境中捕捉人-场景交互已取得显著进展,但现有方法难以从网络视频中重建自然且多样化的人体运动与场景上下文。本研究提出JOSH——一种基于优化的新型方法,用于从单目视频中实现真实场景的4D人-场景重建。JOSH以稠密场景重建与人体网格恢复技术作为初始化,进而利用人-场景接触约束对场景、相机位姿及人体运动进行联合优化。实验结果表明,通过场景几何与人体运动的联合优化,JOSH在全局人体运动估计与稠密场景重建方面均取得更优结果。我们进一步设计了更高效的JOSH3R模型,并直接使用网络视频生成的伪标签进行训练。JOSH3R仅通过JOSH预测的标签进行训练,其性能即超越其他非优化方法,进一步验证了该方法的准确性与泛化能力。