With the rapid advancement of text-conditioned Video Generation Models (VGMs), the quality of generated videos has significantly improved, bringing these models closer to functioning as ``*world simulators*'' and making real-world-level video generation more accessible and cost-effective. However, the generated videos often contain factual inaccuracies and lack understanding of fundamental physical laws. While some previous studies have highlighted this issue in limited domains through manual analysis, a comprehensive solution has not yet been established, primarily due to the absence of a generalized, automated approach for modeling and assessing the causal reasoning of these models across diverse scenarios. To address this gap, we propose VACT: an **automated** framework for modeling, evaluating, and measuring the causal understanding of VGMs in real-world scenarios. By combining causal analysis techniques with a carefully designed large language model assistant, our system can assess the causal behavior of models in various contexts without human annotation, which offers strong generalization and scalability. Additionally, we introduce multi-level causal evaluation metrics to provide a detailed analysis of the causal performance of VGMs. As a demonstration, we use our framework to benchmark several prevailing VGMs, offering insight into their causal reasoning capabilities. Our work lays the foundation for systematically addressing the causal understanding deficiencies in VGMs and contributes to advancing their reliability and real-world applicability.
翻译:随着文本条件视频生成模型(VGMs)的快速发展,生成视频的质量已显著提升,使得这些模型更接近于作为“*世界模拟器*”运行,并使真实世界级别的视频生成变得更加便捷和经济。然而,生成的视频常常包含事实错误,且缺乏对基本物理定律的理解。尽管先前的一些研究通过人工分析在有限领域内指出了这一问题,但由于缺乏一种通用的自动化方法来建模和评估这些模型在不同场景下的因果推理能力,全面的解决方案尚未建立。为填补这一空白,我们提出了VACT:一个用于在真实世界场景中建模、评估和度量VGMs因果理解的**自动化**框架。通过将因果分析技术与精心设计的大型语言模型助手相结合,我们的系统能够在无需人工标注的情况下评估模型在各种情境下的因果行为,从而具备强大的泛化能力和可扩展性。此外,我们引入了多层次的因果评估指标,以提供对VGMs因果性能的详细分析。作为演示,我们使用该框架对几种主流VGMs进行了基准测试,揭示了它们的因果推理能力。我们的工作为系统性地解决VGMs在因果理解方面的不足奠定了基础,并有助于提升其可靠性和现实世界适用性。