Recent advancements in text-to-video models such as Sora, Gen-3, MovieGen, and CogVideoX are pushing the boundaries of synthetic video generation, with adoption seen in fields like robotics, autonomous driving, and entertainment. As these models become prevalent, various metrics and benchmarks have emerged to evaluate the quality of the generated videos. However, these metrics emphasize visual quality and smoothness, neglecting temporal fidelity and text-to-video alignment, which are crucial for safety-critical applications. To address this gap, we introduce NeuS-V, a novel synthetic video evaluation metric that rigorously assesses text-to-video alignment using neuro-symbolic formal verification techniques. Our approach first converts the prompt into a formally defined Temporal Logic (TL) specification and translates the generated video into an automaton representation. Then, it evaluates the text-to-video alignment by formally checking the video automaton against the TL specification. Furthermore, we present a dataset of temporally extended prompts to evaluate state-of-the-art video generation models against our benchmark. We find that NeuS-V demonstrates a higher correlation by over 5x with human evaluations when compared to existing metrics. Our evaluation further reveals that current video generation models perform poorly on these temporally complex prompts, highlighting the need for future work in improving text-to-video generation capabilities.
翻译:近期,Sora、Gen-3、MovieGen和CogVideoX等文本到视频模型的进展正在突破合成视频生成的边界,并在机器人、自动驾驶和娱乐等领域得到应用。随着这些模型的普及,出现了多种评估生成视频质量的指标和基准。然而,现有指标主要关注视觉质量与流畅性,忽视了时间维度保真度与文本-视频对齐性——这两者对安全关键型应用至关重要。为填补这一空白,我们提出了NeuS-V,一种利用神经符号形式化验证技术严格评估文本-视频对齐性的新型合成视频评估指标。我们的方法首先将提示文本转化为形式化定义的时序逻辑规范,并将生成视频转换为自动机表示,随后通过形式化检验视频自动机对时序逻辑规范的符合程度来评估文本-视频对齐性。此外,我们构建了一个时序扩展提示数据集,用于在基准测试中评估前沿视频生成模型。实验表明,与现有指标相比,NeuS-V与人工评估的相关性提升超过5倍。我们的评估进一步揭示,当前视频生成模型在处理这类时序复杂提示时表现欠佳,凸显了未来提升文本到视频生成能力的迫切需求。