Recent generative video models achieve impressive visual quality but remain constrained by limited physical consistency and controllability. Existing video generation methods provide minimal physical control, and single-image-to-3D conversion approaches often suffer from object interpenetration. Furthermore, physics-based scene-level 3D generation methods exhibit spatial misalignment, stylized artifacts, and inconsistencies with the input data, restricting their use in realistic interactive video synthesis. We propose TelePhysics, a training-free framework that converts a single image into a physically consistent and controllable video through holistic scene-level 3D reconstruction. By representing the full scene geometry in a unified spatial coordinate system, TelePhysics resolves object penetration and alignment ambiguity. Unlike prior methods, this formulation enables accurate scenelevel multi-object interactions and introduces richer, complex control types for advanced mechanicsbased manipulation. By decoupling simulation from rendering, TelePhysics bypasses latency-heavy priors, achieving real-time physical interaction previews paired while preserving photorealistic visual fidelity. Experimental results demonstrate that TelePhysics substantially outperforms prior methods in physical fidelity, spatial coherence, and controllability. The open-source code is available at https://github.com/xinzhang007/TelePhysics.
翻译:近期生成的视频模型在视觉质量上取得了令人瞩目的效果,但仍受限于物理一致性和可控性不足。现有视频生成方法仅提供有限的物理控制,而单图像到3D转换方法常面临物体相互穿透的问题。此外,基于物理的场景级3D生成方法存在空间错位、风格化伪影以及与输入数据不一致的缺陷,限制了其在逼真交互式视频合成中的应用。我们提出TelePhysics——一种无需训练的框架,通过整体场景级3D重建将单张图像转化为物理一致且可控的视频。通过在全场景统一空间坐标系中表示几何结构,TelePhysics解决了物体穿透与对齐歧义问题。与先前方法不同,该框架能实现精确的场景级多物体交互,并引入更丰富、复杂的控制类型以支持基于力学的高级操控。通过解耦模拟与渲染,TelePhysics绕过了高延迟先验,在保持照片级视觉保真度的同时,实现了实时物理交互预览。实验结果表明,TelePhysics在物理保真度、空间连贯性和可控性方面显著优于现有方法。开源代码已发布于https://github.com/xinzhang007/TelePhysics。