Evaluating the quality of videos generated from text-to-video (T2V) models is important if they are to produce plausible outputs that convince a viewer of their authenticity. We examine some of the metrics used in this area and highlight their limitations. The paper presents a dataset of more than 1,000 generated videos from 5 very recent T2V models on which some of those commonly used quality metrics are applied. We also include extensive human quality evaluations on those videos, allowing the relative strengths and weaknesses of metrics, including human assessment, to be compared. The contribution is an assessment of commonly used quality metrics, and a comparison of their performances and the performance of human evaluations on an open dataset of T2V videos. Our conclusion is that naturalness and semantic matching with the text prompt used to generate the T2V output are important but there is no single measure to capture these subtleties in assessing T2V model output.
翻译:评估文本到视频(T2V)模型所生成视频的质量,对于确保其输出具有说服力、使观众相信其真实性至关重要。本文考察了该领域常用的一些评估指标,并指出了它们的局限性。我们构建了一个包含超过1000个视频的数据集,这些视频由5个最新T2V模型生成,并对这些视频应用了一些常见的质量指标。此外,我们还在这些视频上进行了大量的人类质量评估,从而能够比较各个指标(包括人类评估)的相对优劣。本文的贡献在于对常用质量指标进行了评估,并在一个开放的T2V视频数据集上比较了这些指标的表现及人类评估的效果。我们的结论是:自然度以及与用于生成T2V输出的文本提示的语义匹配十分重要,但目前尚无单一指标能够捕捉评估T2V模型输出时的这些细微差别。