How Far is Video Generation from World Model: A Physical Law Perspective

OpenAI's Sora highlights the potential of video generation for developing world models that adhere to fundamental physical laws. However, the ability of video generation models to discover such laws purely from visual data without human priors can be questioned. A world model learning the true law should give predictions robust to nuances and correctly extrapolate on unseen scenarios. In this work, we evaluate across three key scenarios: in-distribution, out-of-distribution, and combinatorial generalization. We developed a 2D simulation testbed for object movement and collisions to generate videos deterministically governed by one or more classical mechanics laws. This provides an unlimited supply of data for large-scale experimentation and enables quantitative evaluation of whether the generated videos adhere to physical laws. We trained diffusion-based video generation models to predict object movements based on initial frames. Our scaling experiments show perfect generalization within the distribution, measurable scaling behavior for combinatorial generalization, but failure in out-of-distribution scenarios. Further experiments reveal two key insights about the generalization mechanisms of these models: (1) the models fail to abstract general physical rules and instead exhibit "case-based" generalization behavior, i.e., mimicking the closest training example; (2) when generalizing to new cases, models are observed to prioritize different factors when referencing training data: color > size > velocity > shape. Our study suggests that scaling alone is insufficient for video generation models to uncover fundamental physical laws, despite its role in Sora's broader success. See our project page at https://phyworld.github.io

翻译：OpenAI的Sora彰显了视频生成在构建遵循基础物理定律的世界模型方面的潜力。然而，视频生成模型能否在无需人类先验知识的情况下，仅从视觉数据中发现此类物理定律仍存疑问。一个学习真实物理定律的世界模型，其预测应对细微变化保持稳健，并能在未见场景中正确外推。本研究通过三个关键场景进行评估：分布内泛化、分布外泛化及组合泛化。我们开发了一个用于物体运动与碰撞的二维模拟测试平台，该平台可生成由单一或多个经典力学定律确定性控制的视频。这为大规模实验提供了无限数据供应，并支持对生成视频是否遵循物理定律进行定量评估。我们训练了基于扩散模型的视频生成模型，以根据初始帧预测物体运动。我们的规模化实验表明：模型在分布内场景实现完美泛化，在组合泛化中呈现可量化的扩展规律，但在分布外场景中表现失败。进一步实验揭示了这些模型泛化机制的两个关键发现：（1）模型未能抽象出普适物理规则，而是表现出“基于案例”的泛化行为，即模仿最接近的训练样本；（2）在泛化至新案例时，模型参考训练数据时呈现优先级差异：颜色 > 尺寸 > 速度 > 形状。我们的研究表明，尽管规模化在Sora的整体成功中发挥作用，但仅靠扩展规模不足以使视频生成模型发现基础物理定律。项目页面详见 https://phyworld.github.io

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

14+阅读 · 2022年3月12日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日