Recent advances in diffusion-based video generation have significantly improved visual quality and short-term temporal coherence. However, existing methods still struggle to produce videos with physically consistent and causally plausible dynamics, especially in scenarios involving long-horizon interactions. This limitation arises from the fact that video diffusion models primarily learn physical consistency implicitly, while vision-language models can directly model physical laws. Based on this idea, in this work, we propose \textbf{CausalMotion}, a training-free framework that injects explicit physical reasoning into video generation through structured intermediate representations. Our key idea is to decouple reasoning from generation by leveraging a vision-language model to decompose a text prompt into a sequence of causally consistent keyframes and object-centric motion trajectories. These representations are then aligned and integrated as soft constraints to guide a pretrained video diffusion model during inference. This design enables explicit modeling of object dynamics and causal transitions without requiring additional training or supervision. Extensive experiments show that our method consistently improves physical plausibility and temporal coherence, particularly in dynamics-intensive scenarios, while maintaining high perceptual video quality.
翻译:近期基于扩散模型的视频生成技术取得了显著进展,显著提升了视觉质量与短期时间连贯性。然而,现有方法在生成具有物理一致性与因果合理动态的视频方面仍面临挑战,尤其是在长程交互场景中。这一局限源于视频扩散模型主要隐式学习物理一致性,而视觉-语言模型能够直接建模物理规律。基于此思路,本文提出\textbf{CausalMotion}——一种无需训练的框架,通过结构化中间表示将显式物理推理注入视频生成过程。核心思想是将推理与生成解耦:利用视觉-语言模型将文本提示分解为因果一致的关键帧序列与以对象为中心的运动轨迹。这些表示经对齐整合后,作为软约束引导预训练视频扩散模型的推理过程。该设计无需额外训练或监督即可实现对象动态与因果转换的显式建模。大量实验表明,本方法在保持高感知视频质量的同时,尤其在动态密集型场景中,持续提升了物理合理性与时间连贯性。