Neuro-Symbolic Evaluation of Text-to-Video Models using Formal Verification

Recent advancements in text-to-video models such as Sora, Gen-3, MovieGen, and CogVideoX are pushing the boundaries of synthetic video generation, with adoption seen in fields like robotics, autonomous driving, and entertainment. As these models become prevalent, various metrics and benchmarks have emerged to evaluate the quality of the generated videos. However, these metrics emphasize visual quality and smoothness, neglecting temporal fidelity and text-to-video alignment, which are crucial for safety-critical applications. To address this gap, we introduce NeuS-V, a novel synthetic video evaluation metric that rigorously assesses text-to-video alignment using neuro-symbolic formal verification techniques. Our approach first converts the prompt into a formally defined Temporal Logic (TL) specification and translates the generated video into an automaton representation. Then, it evaluates the text-to-video alignment by formally checking the video automaton against the TL specification. Furthermore, we present a dataset of temporally extended prompts to evaluate state-of-the-art video generation models against our benchmark. We find that NeuS-V demonstrates a higher correlation by over 5x with human evaluations when compared to existing metrics. Our evaluation further reveals that current video generation models perform poorly on these temporally complex prompts, highlighting the need for future work in improving text-to-video generation capabilities.

翻译：近期，Sora、Gen-3、MovieGen和CogVideoX等文本到视频模型的进展正在突破合成视频生成的边界，并在机器人、自动驾驶和娱乐等领域得到应用。随着这些模型的普及，出现了多种评估生成视频质量的指标和基准。然而，现有指标主要关注视觉质量与流畅性，忽视了时间维度保真度与文本-视频对齐性——这两者对安全关键型应用至关重要。为填补这一空白，我们提出了NeuS-V，一种利用神经符号形式化验证技术严格评估文本-视频对齐性的新型合成视频评估指标。我们的方法首先将提示文本转化为形式化定义的时序逻辑规范，并将生成视频转换为自动机表示，随后通过形式化检验视频自动机对时序逻辑规范的符合程度来评估文本-视频对齐性。此外，我们构建了一个时序扩展提示数据集，用于在基准测试中评估前沿视频生成模型。实验表明，与现有指标相比，NeuS-V与人工评估的相关性提升超过5倍。我们的评估进一步揭示，当前视频生成模型在处理这类时序复杂提示时表现欠佳，凸显了未来提升文本到视频生成能力的迫切需求。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日