The recent years have witnessed great advances in video generation. However, the development of automatic video metrics is lagging significantly behind. None of the existing metric is able to provide reliable scores over generated videos. The main barrier is the lack of large-scale human-annotated dataset. In this paper, we release VideoFeedback, the first large-scale dataset containing human-provided multi-aspect score over 37.6K synthesized videos from 11 existing video generative models. We train VideoScore (initialized from Mantis) based on VideoFeedback to enable automatic video quality assessment. Experiments show that the Spearman correlation between VideoScore and humans can reach 77.1 on VideoFeedback-test, beating the prior best metrics by about 50 points. Further result on other held-out EvalCrafter, GenAI-Bench, and VBench show that VideoScore has consistently much higher correlation with human judges than other metrics. Due to these results, we believe VideoScore can serve as a great proxy for human raters to (1) rate different video models to track progress (2) simulate fine-grained human feedback in Reinforcement Learning with Human Feedback (RLHF) to improve current video generation models.
翻译:近年来,视频生成领域取得了巨大进展。然而,自动视频评估指标的发展却显著滞后。现有指标均无法对生成视频提供可靠的评分。其主要障碍在于缺乏大规模人工标注数据集。本文发布了VideoFeedback,这是首个大规模数据集,包含人类对来自11个现有视频生成模型的37.6K个合成视频提供的多维度评分。我们基于VideoFeedback训练VideoScore(从Mantis初始化),以实现自动视频质量评估。实验表明,在VideoFeedback-test上,VideoScore与人类评分的斯皮尔曼相关系数可达77.1,较先前最佳指标提升约50分。在EvalCrafter、GenAI-Bench和VBench等其他独立测试集上的进一步结果表明,VideoScore与人类评价者的相关性始终远高于其他指标。基于这些结果,我们相信VideoScore可作为人类评分者的优秀代理,以(1)评估不同视频模型并追踪进展;(2)在人类反馈强化学习(RLHF)中模拟细粒度人类反馈,以改进当前视频生成模型。