Motion Dreamer：通过场景感知运动推理实现物理一致视频生成 (Motion Dreamer: Realizing Physically Coherent Video Generation through Scene-Aware Motion Reasoning)

Recent numerous video generation models, also known as world models, have demonstrated the ability to generate plausible real-world videos. However, many studies have shown that these models often produce motion results lacking logical or physical coherence. In this paper, we revisit video generation models and find that single-stage approaches struggle to produce high-quality results while maintaining coherent motion reasoning. To address this issue, we propose \textbf{Motion Dreamer}, a two-stage video generation framework. In Stage I, the model generates an intermediate motion representation-such as a segmentation map or depth map-based on the input image and motion conditions, focusing solely on the motion itself. In Stage II, the model uses this intermediate motion representation as a condition to generate a high-detail video. By decoupling motion reasoning from high-fidelity video synthesis, our approach allows for more accurate and physically plausible motion generation. We validate the effectiveness of our approach on the Physion dataset and in autonomous driving scenarios. For example, given a single push, our model can synthesize the sequential toppling of a set of dominoes. Similarly, by varying the movements of ego-cars, our model can produce different effects on other vehicles. Our work opens new avenues in creating models that can reason about physical interactions in a more coherent and realistic manner. Our webpage is available: https://envision-research.github.io/MotionDreamer/.

翻译：近期涌现的大量视频生成模型（亦称为世界模型）已展现出生成逼真现实世界视频的能力。然而，多项研究表明，这些模型常产生缺乏逻辑或物理一致性的运动结果。本文重新审视视频生成模型，发现单阶段方法难以在保持连贯运动推理的同时生成高质量结果。为解决该问题，我们提出\textbf{Motion Dreamer}——一个两阶段视频生成框架。在第一阶段，模型基于输入图像与运动条件生成中间运动表征（如分割图或深度图），该阶段仅聚焦于运动本身。在第二阶段，模型以此中间运动表征为条件生成高细节视频。通过将运动推理与高保真视频合成解耦，我们的方法能够实现更精确且物理合理的运动生成。我们在Physion数据集与自动驾驶场景中验证了方法的有效性。例如，给定单次推动条件，我们的模型可合成多米诺骨牌的连续倾倒过程；类似地，通过改变自车运动，模型能对其他车辆产生不同的影响效果。本研究为创建能以更连贯、更真实方式推理物理交互的模型开辟了新途径。项目网页地址：https://envision-research.github.io/MotionDreamer/。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日