We introduce CRISP, a method that recovers simulatable human motion and scene geometry from monocular video. Prior work on joint human-scene reconstruction relies on data-driven priors and joint optimization with no physics in the loop, or recovers noisy geometry with artifacts that cause motion tracking policies with scene interactions to fail. In contrast, our key insight is to recover convex, clean, and simulation-ready geometry by fitting planar primitives to a point cloud reconstruction of the scene, via a simple clustering pipeline over depth, normals, and flow. To reconstruct scene geometry that might be occluded during interactions, we make use of human-scene contact modeling (e.g., we use human posture to reconstruct the occluded seat of a chair). Finally, we ensure that human and scene reconstructions are physically-plausible by using them to drive a humanoid controller via reinforcement learning. Our approach reduces motion tracking failure rates from 55.2\% to 6.9\% on human-centric video benchmarks (EMDB, PROX), while delivering a 43\% faster RL simulation throughput. We further validate it on in-the-wild videos including casually-captured videos, Internet videos, and even Sora-generated videos. This demonstrates CRISP's ability to generate physically-valid human motion and interaction environments at scale, greatly advancing real-to-sim applications for robotics and AR/VR.
翻译:本文提出CRISP方法,该方法能够从单目视频中恢复可仿真的运动与场景几何。现有的人-场景联合重建方法依赖于数据驱动的先验知识和无物理约束的联合优化,或重建出带有伪影的噪声几何,导致包含场景交互的运动跟踪策略失效。相比之下,我们的核心思路是通过对场景点云进行深度、法向量和光流信息的简单聚类流程,拟合平面基元来恢复凸性、干净且可直接用于仿真的几何。为重建交互过程中可能被遮挡的场景几何,我们利用人-场景接触建模(例如通过人体姿态重建被遮挡的椅面)。最后,我们通过强化学习驱动人形控制器,确保人与场景重建结果具有物理合理性。在以人为中心的视频基准测试(EMDB、PROX)中,本方法将运动跟踪失败率从55.2%降低至6.9%,同时使强化学习仿真吞吐量提升43%。我们进一步在野外视频(包括随意拍摄的视频、网络视频甚至Sora生成的视频)上验证了本方法。这证明了CRISP能够大规模生成物理有效的人体运动与交互环境,极大推动了机器人及AR/VR领域的真实到仿真应用。