Recent advances in image and video generation raise hopes that these models possess world modeling capabilities, the ability to generate realistic, physically plausible videos. This could revolutionize applications in robotics, autonomous driving, and scientific simulation. However, before treating these models as world models, we must ask: Do they adhere to physical conservation laws? To answer this, we introduce Morpheus, a benchmark for evaluating video generation models on physical reasoning. It features 80 real-world videos capturing physical phenomena, guided by conservation laws. Since artificial generations lack ground truth, we assess physical plausibility using physics-informed metrics evaluated with respect to infallible conservation laws known per physical setting, leveraging advances in physics-informed neural networks and vision-language foundation models. Our findings reveal that even with advanced prompting and video conditioning, current models struggle to encode physical principles despite generating aesthetically pleasing videos. All data, leaderboard, and code are open-sourced at our project page.
翻译:近期图像与视频生成领域的进展使人们期待这些模型具备世界建模能力,即生成真实且物理合理的视频。这可能为机器人学、自动驾驶和科学仿真等应用带来革命性变化。然而,在将这些模型视为世界模型之前,我们必须追问:它们是否遵循物理守恒定律?为回答这一问题,我们提出了Morpheus——一个基于物理推理评估视频生成模型的基准测试。该基准包含80段捕捉物理现象的真实世界视频,其设计遵循守恒定律指导。由于人工生成内容缺乏真实参照,我们通过基于物理信息的度量指标来评估物理合理性,这些指标依据每种物理场景中已知的绝对守恒定律进行计算,并融合了物理信息神经网络与视觉-语言基础模型的最新进展。我们的研究结果表明,即使采用先进的提示技术和视频条件控制,当前模型在生成视觉美观视频的同时,仍难以有效编码物理原理。所有数据、排行榜及代码已在项目页面开源。