Driven by the growing capacity and training scale, Text-to-Video (T2V) generation models have recently achieved substantial progress in video quality, length, and instruction-following capability. However, whether these models can understand physics and generate physically plausible videos remains a question. While Vision-Language Models (VLMs) have been widely used as general-purpose evaluators in various applications, they struggle to identify the physically impossible content from generated videos. To investigate this issue, we construct a \textbf{PID} (\textbf{P}hysical \textbf{I}mplausibility \textbf{D}etection) dataset, which consists of a \textit{test split} of 500 manually annotated videos and a \textit{train split} of 2,588 paired videos, where each implausible video is generated by carefully rewriting the caption of its corresponding real-world video to induce T2V models producing physically implausible content. With the constructed dataset, we introduce a lightweight fine-tuning approach, enabling VLMs to not only detect physically implausible events but also generate textual explanations on the violated physical principles. Taking the fine-tuned VLM as a physical plausibility detector and explainer, namely \textbf{PhyDetEx}, we benchmark a series of state-of-the-art T2V models to assess their adherence to physical laws. Our findings show that although recent T2V models have made notable progress toward generating physically plausible content, understanding and adhering to physical laws remains a challenging issue, especially for open-source models. Our dataset, training code, and checkpoints are available at \href{https://github.com/Zeqing-Wang/PhyDetEx}{https://github.com/Zeqing-Wang/PhyDetEx}.
翻译:受制于不断增长的模型容量与训练规模,文本到视频生成模型近年来在视频质量、时长及指令遵循能力方面取得了显著进展。然而,这些模型是否能够理解物理规律并生成物理合理的视频仍是一个待解决的问题。尽管视觉-语言模型作为通用评估器已被广泛应用于多种场景,但它们难以识别生成视频中的物理不可能内容。为探究这一问题,我们构建了PID(物理不合理性检测)数据集,包含500个人工标注视频的测试集与2588个配对视频的训练集。其中每个不合理视频通过精心改写对应真实视频的标题生成,诱导文本到视频模型产生物理不合理内容。基于构建的数据集,我们提出一种轻量级微调方法,使视觉-语言模型不仅能检测物理不合理事件,还能生成违反物理原理的文本解释。将微调后的视觉-语言模型作为物理合理性检测器与解释器(即PhyDetEx),我们对一系列最先进的文本到视频生成模型进行基准测试,评估其遵循物理定律的能力。研究发现表明,尽管近期文本到视频生成模型在生成物理合理内容方面取得显著进展,但理解与遵循物理规律仍具挑战性,尤其对于开源模型而言。我们提供的数据集、训练代码及模型检查点可通过https://github.com/Zeqing-Wang/PhyDetEx获取。