Large multimodal models (LMMs) have demonstrated outstanding capabilities in various visual perception tasks, which has in turn made the evaluation of LMMs significant. However, the capability of video aesthetic quality assessment, which is a fundamental ability for human, remains underexplored for LMMs. To address this, we introduce VideoAesBench, a comprehensive benchmark for evaluating LMMs' understanding of video aesthetic quality. VideoAesBench has several significant characteristics: (1) Diverse content including 1,804 videos from multiple video sources including user-generated (UGC), AI-generated (AIGC), compressed, robotic-generated (RGC), and game videos. (2) Multiple question formats containing traditional single-choice questions, multi-choice questions, True or False questions, and a novel open-ended questions for video aesthetics description. (3) Holistic video aesthetics dimensions including visual form related questions from 5 aspects, visual style related questions from 4 aspects, and visual affectiveness questions from 3 aspects. Based on VideoAesBench, we benchmark 23 open-source and commercial large multimodal models. Our findings show that current LMMs only contain basic video aesthetics perception ability, their performance remains incomplete and imprecise. We hope our VideoAesBench can be served as a strong testbed and offer insights for explainable video aesthetics assessment.
翻译:大型多模态模型(LMMs)已在多种视觉感知任务中展现出卓越能力,这使得对LMMs的评估变得尤为重要。然而,对于人类而言基础性的视频美学质量评估能力,在LMMs中仍未得到充分探索。为此,我们提出了VideoAesBench,一个用于评估LMMs对视频美学质量理解能力的综合性基准。VideoAesBench具备以下重要特征:(1)内容多样性:包含来自用户生成内容(UGC)、AI生成内容(AIGC)、压缩视频、机器人生成内容(RGC)及游戏视频等多种视频源的1,804个视频。(2)多种问题形式:包含传统的单选题、多选题、判断题,以及新颖的用于视频美学描述的开源性问题。(3)全面的视频美学维度:涵盖视觉形式相关的5个方面问题、视觉风格相关的4个方面问题以及视觉感染力相关的3个方面问题。基于VideoAesBench,我们对23个开源及商业大型多模态模型进行了基准测试。研究发现,当前LMMs仅具备基础的视频美学感知能力,其表现仍不完整且不够精确。我们希望VideoAesBench能作为一个强有力的测试平台,并为可解释的视频美学评估提供见解。