Reconstructing physically stable 3D scenes from a single RGB image enables casual images to be converted into simulation-ready digital assets for applications such as immersive interaction and content creation. However, existing single-image reconstruction methods fall short in capturing the physical structure of a scene. As a result, they often produce geometrically plausible but physically inconsistent results, including object floating and penetration, which lead to unstable behavior in physics simulations. Image-conditioned scene generation methods improve physical plausibility but often rely on strong scene priors, yielding plausible yet inaccurate object arrangements that fail to match the input image. We propose REST3D, a single-image reconstruction framework that can reconstruct physically stable 3D scenes by integrating physical scene understanding with physics-constrained refinement. We first introduce an agentic physical scene understanding technique that constructs a scene-tree representation capturing object physical states and inter-object relationships from a gravity-support perspective, providing a structural prior for reconstruction. Leveraging this structure, we initialize the scene using image-to-3D models, followed by scene-tree-guided alignment and physics-constrained optimization to resolve physical violations while preserving visual consistency with the input image. Experiments show that our method significantly reduces physical errors and improves simulation stability on both synthetic and real-world datasets while maintaining strong reconstruction quality. We further demonstrate the reconstructed scenes in VR-based human-object interaction, showing their potential for immersive applications.
翻译:从单张RGB图像重建物理稳定的3D场景,可将日常图像转换为适用于沉浸式交互和内容创作等应用的模拟就绪数字资产。然而,现有单图像重建方法在捕捉场景物理结构方面存在不足,常产生几何上合理但物理上不一致的结果(包括物体悬浮和穿透),导致物理模拟中出现不稳定行为。基于图像条件的场景生成方法虽能提升物理合理性,但往往依赖强场景先验,生成的物体排列虽有合理性却与输入图像不匹配。我们提出REST3D——一种集成物理场景理解与物理约束优化的单图像重建框架,能够重建物理稳定的3D场景。我们首先引入智能体式物理场景理解技术,从重力支撑视角构建捕捉物体物理状态及物体间关系的场景树表征,为重建提供结构先验。基于该结构,我们利用图像转3D模型初始化场景,随后通过场景树引导的对齐与物理约束优化,在消除物理违规的同时保持与输入图像的视觉一致性。实验表明,本方法在合成与真实数据集上显著减少物理错误、提升模拟稳定性,同时保持良好重建质量。我们进一步在VR人机交互中展示重建场景,彰显其在沉浸式应用中的潜力。