Recent diffusion-based video generation models can synthesize visually plausible videos, yet they often struggle to satisfy physical constraints. A key reason is that most existing approaches remain single-stage: they entangle high-level physical understanding with low-level visual synthesis, making it hard to generate content that require explicit physical reasoning. To address this limitation, we propose a training-free three-stage pipeline,\textit{PhyRPR}:\textit{Phy\uline{R}eason}--\textit{Phy\uline{P}lan}--\textit{Phy\uline{R}efine}, which decouples physical understanding from visual synthesis. Specifically, \textit{PhyReason} uses a large multimodal model for physical state reasoning and an image generator for keyframe synthesis; \textit{PhyPlan} deterministically synthesizes a controllable coarse motion scaffold; and \textit{PhyRefine} injects this scaffold into diffusion sampling via a latent fusion strategy to refine appearance while preserving the planned dynamics. This staged design enables explicit physical control during generation. Extensive experiments under physics constraints show that our method consistently improves physical plausibility and motion controllability.
翻译:近期基于扩散模型的视频生成方法能够合成视觉上合理的视频,但往往难以满足物理约束。一个关键原因是现有方法大多仍为单阶段框架:它们将高层物理理解与低层视觉合成相耦合,导致难以生成需要显式物理推理的内容。为突破这一局限,我们提出一种免训练的三阶段流程 \textit{PhyRPR}:\textit{物理推理}--\textit{物理规划}--\textit{物理优化},其将物理理解与视觉合成进行解耦。具体而言,\textit{PhyReason} 阶段采用大型多模态模型进行物理状态推理,并利用图像生成器合成关键帧;\textit{PhyPlan} 阶段通过确定性方法生成可控的粗粒度运动骨架;\textit{PhyRefine} 阶段则通过潜在融合策略将该骨架注入扩散采样过程,在保持规划动力学的同时优化外观表现。这种分阶段设计实现了生成过程中对物理属性的显式控制。在多种物理约束下的实验表明,本方法能持续提升生成结果的物理合理性与运动可控性。