Recent numerous video generation models, also known as world models, have demonstrated the ability to generate plausible real-world videos. However, many studies have shown that these models often produce motion results lacking logical or physical coherence. In this paper, we revisit video generation models and find that single-stage approaches struggle to produce high-quality results while maintaining coherent motion reasoning. To address this issue, we propose \textbf{Motion Dreamer}, a two-stage video generation framework. In Stage I, the model generates an intermediate motion representation-such as a segmentation map or depth map-based on the input image and motion conditions, focusing solely on the motion itself. In Stage II, the model uses this intermediate motion representation as a condition to generate a high-detail video. By decoupling motion reasoning from high-fidelity video synthesis, our approach allows for more accurate and physically plausible motion generation. We validate the effectiveness of our approach on the Physion dataset and in autonomous driving scenarios. For example, given a single push, our model can synthesize the sequential toppling of a set of dominoes. Similarly, by varying the movements of ego-cars, our model can produce different effects on other vehicles. Our work opens new avenues in creating models that can reason about physical interactions in a more coherent and realistic manner. Our webpage is available: https://envision-research.github.io/MotionDreamer/.
翻译:近期涌现的大量视频生成模型(亦称为世界模型)已展现出生成逼真现实世界视频的能力。然而,多项研究表明,这些模型常产生缺乏逻辑或物理一致性的运动结果。本文重新审视视频生成模型,发现单阶段方法难以在保持连贯运动推理的同时生成高质量结果。为解决该问题,我们提出\textbf{Motion Dreamer}——一个两阶段视频生成框架。在第一阶段,模型基于输入图像与运动条件生成中间运动表征(如分割图或深度图),该阶段仅聚焦于运动本身。在第二阶段,模型以此中间运动表征为条件生成高细节视频。通过将运动推理与高保真视频合成解耦,我们的方法能够实现更精确且物理合理的运动生成。我们在Physion数据集与自动驾驶场景中验证了方法的有效性。例如,给定单次推动条件,我们的模型可合成多米诺骨牌的连续倾倒过程;类似地,通过改变自车运动,模型能对其他车辆产生不同的影响效果。本研究为创建能以更连贯、更真实方式推理物理交互的模型开辟了新途径。项目网页地址:https://envision-research.github.io/MotionDreamer/。