We present HSImul3R, a unified framework for simulation-ready 3D reconstruction of human-scene interactions (HSI) from casual captures, including sparse-view images and monocular videos. Existing methods suffer from a perception-simulation gap: visually plausible reconstructions often violate physical constraints, leading to instability in physics engines and failure in embodied AI applications. To bridge this gap, we introduce a physically-grounded bi-directional optimization pipeline that treats the physics simulator as an active supervisor to jointly refine human dynamics and scene geometry. In the forward direction, we employ Scene-targeted Reinforcement Learning to optimize human motion under dual supervision of motion fidelity and contact stability. In the reverse direction, we propose Direct Simulation Reward Optimization, which leverages simulation feedback on gravitational stability and interaction success to refine scene geometry. We further present HSIBench, a new benchmark with diverse objects and interaction scenarios. Extensive experiments demonstrate that HSImul3R produces the first stable, simulation-ready HSI reconstructions and can be directly deployed to real-world humanoid robots.
翻译:我们提出了HSImul3R,一个用于从稀疏视角图像和单目视频等随意捕捉数据中,实现仿真就绪的人类-场景交互三维重建的统一框架。现有方法存在感知与仿真的鸿沟:视觉上看似合理的重建结果常常违反物理约束,导致在物理引擎中不稳定,并在具身人工智能应用中失败。为弥合这一鸿沟,我们引入了一种基于物理的双向优化流程,将物理仿真器作为主动监督器,联合优化人体动力学与场景几何。在前向过程中,我们采用场景导向强化学习,在运动保真度与接触稳定性的双重监督下优化人体运动。在反向过程中,我们提出了直接仿真奖励优化,利用仿真器在重力稳定性与交互成功率方面的反馈来优化场景几何。我们还进一步提出了HSIBench,一个包含多样化物体与交互场景的新基准。大量实验表明,HSImul3R能够生成首个稳定、仿真就绪的HSI重建结果,并可直接部署于真实世界的人形机器人。