Video generative models have made remarkable progress, yet they often yield visual artifacts that violate grounding in physical dynamics. Recent works such as PhysGen3D tackle single image-to-3D physics through mesh reconstruction and Physically-Based Rendering, but challenges remain in modeling fluid dynamics, multi-object interactions and photorealism. This work introduces 3DPhysVideo, a novel training-free pipeline that generates physically realistic videos from a single image. We repurpose an off-the-shelf video model for two stages. First, we use it as a novel view synthesizer to reconstruct complete 360-degree 3D scene geometry by guiding the image-to-video (I2V) flow model with rendered point clouds. Second, after applying physics solvers to this geometry, the physically simulated point cloud is used to guide the same I2V flow model to synthesize final, high-quality videos. Consistency-Guided Flow SDE, which decomposes the predicted velocity of the I2V flow model into denoising and consistency bias, enforces consistency to the conditional inputs, allowing us to effectively repurpose the model for both 3D reconstruction and simulation-guided video generation. In the diverse experiments including multi-objects, and fluid interaction scenes, our method successfully bridges the gap from single-images to physically plausible videos, while remaining efficient to run on a single consumer GPU. It outperforms state-of-the-art baselines on GPT-based scores, VideoPhy benchmark and human evaluation.
翻译:视频生成模型取得了显著进展,但生成的视频常出现违背物理动力学规律的视觉伪影。近年来如PhysGen3D等工作通过网格重建与基于物理的渲染实现单图到三维物理的映射,但在流体动力学建模、多物体交互及写实渲染方面仍存在挑战。本文提出3DPhysVideo——一种无需训练的新型流水线,可从单张图像生成符合物理规律的逼真视频。我们通过复用现成视频模型实现两阶段处理:首先,利用渲染点云引导图像到视频(I2V)流模型,将其作为新视角合成器重建完整360度三维场景几何;其次,对该几何结构应用物理求解器后,通过物理仿真点云引导同一I2V流模型生成最终高质量视频。一致性引导流SDE将I2V流模型的预测速度分解为去噪项与一致性偏置项,强制输出与条件输入保持一致,从而有效将模型复用至三维重建与仿真引导视频生成两大任务。在多物体及流体交互场景的多样化实验中,本方法成功弥合了单图到物理合理视频之间的鸿沟,并可在单张消费级GPU上高效运行。在基于GPT的评分、VideoPhy基准测试及人工评估中均超越现有最优基线方法。