Generative models have demonstrated remarkable capability in synthesizing high-quality text, images, and videos. For video generation, contemporary text-to-video models exhibit impressive capabilities, crafting visually stunning videos. Nonetheless, evaluating such videos poses significant challenges. Current research predominantly employs automated metrics such as FVD, IS, and CLIP Score. However, these metrics provide an incomplete analysis, particularly in the temporal assessment of video content, thus rendering them unreliable indicators of true video quality. Furthermore, while user studies have the potential to reflect human perception accurately, they are hampered by their time-intensive and laborious nature, with outcomes that are often tainted by subjective bias. In this paper, we investigate the limitations inherent in existing metrics and introduce a novel evaluation pipeline, the Text-to-Video Score (T2VScore). This metric integrates two pivotal criteria: (1) Text-Video Alignment, which scrutinizes the fidelity of the video in representing the given text description, and (2) Video Quality, which evaluates the video's overall production caliber with a mixture of experts. Moreover, to evaluate the proposed metrics and facilitate future improvements on them, we present the TVGE dataset, collecting human judgements of 2,543 text-to-video generated videos on the two criteria. Experiments on the TVGE dataset demonstrate the superiority of the proposed T2VScore on offering a better metric for text-to-video generation.
翻译:生成模型在合成高质量文本、图像和视频方面已展现出卓越能力。对于视频生成,当前的文本到视频模型展现出令人瞩目的能力,能够制作视觉惊艳的视频。然而,评估此类视频面临重大挑战。现有研究主要采用FVD、IS和CLIP Score等自动化指标。但这些指标提供的分析并不全面,尤其是在视频内容的时序评估方面,因此难以成为视频真实质量的可靠指标。此外,虽然用户研究有潜力准确反映人类感知,但其耗时费力的特性往往导致结果受到主观偏差的影响。本文在探究现有指标固有局限性的基础上,提出了一种新型评估流程——文本到视频评分(T2VScore)。该指标整合了两个关键标准:(1)文本-视频对齐,用于检验视频忠实呈现给定文本描述的程度;(2)视频质量,通过专家混合模型评估视频的整体制作水准。此外,为评估所提指标并促进其未来改进,我们构建了TVGE数据集,收集了2,543个文本到视频生成视频在这两个标准上的人类判断。在TVGE数据集上的实验表明,所提出的T2VScore在为文本到视频生成提供更优评估指标方面具有显著优势。