Intuitive physics understanding in video diffusion models plays an essential role in building general-purpose physically plausible world simulators, yet accurately evaluating such capacity remains a challenging task due to the difficulty in disentangling physics correctness from visual appearance in generation. To the end, we introduce LikePhys, a training-free method that evaluates intuitive physics in video diffusion models by distinguishing physically valid and impossible videos using the denoising objective as an ELBO-based likelihood surrogate on a curated dataset of valid-invalid pairs. By testing on our constructed benchmark of twelve scenarios spanning over four physics domains, we show that our evaluation metric, Plausibility Preference Error (PPE), demonstrates strong alignment with human preference, outperforming state-of-the-art evaluator baselines. We then systematically benchmark intuitive physics understanding in current video diffusion models. Our study further analyses how model design and inference settings affect intuitive physics understanding and highlights domain-specific capacity variations across physical laws. Empirical results show that, despite current models struggling with complex and chaotic dynamics, there is a clear trend of improvement in physics understanding as model capacity and inference settings scale.
翻译:视频扩散模型中的直觉物理理解对于构建通用的物理合理世界模拟器至关重要,然而,由于难以在生成过程中将物理正确性与视觉外观分离,准确评估这种能力仍然是一项具有挑战性的任务。为此,我们提出了LikePhys,这是一种无需训练的方法,通过在精心构建的有效-无效视频对数据集上,使用去噪目标作为基于ELBO的似然代理,来区分物理有效和不可能的视频,从而评估视频扩散模型中的直觉物理理解。通过在涵盖四个物理领域的十二个场景上构建的基准测试,我们表明我们的评估指标——合理性偏好误差(PPE)与人类偏好高度一致,优于最先进的评估器基线。随后,我们系统地评估了当前视频扩散模型的直觉物理理解能力。我们的研究进一步分析了模型设计和推理设置如何影响直觉物理理解,并强调了跨物理定律的领域特定能力差异。实证结果表明,尽管当前模型在处理复杂和混沌动力学方面存在困难,但随着模型能力和推理设置的扩展,物理理解能力呈现出明显的提升趋势。