With the rising interest in research on Large Multi-modal Models (LMMs) for video understanding, many studies have emphasized general video comprehension capabilities, neglecting the systematic exploration into video quality understanding. To address this oversight, we introduce Q-Bench-Video in this paper, a new benchmark specifically designed to evaluate LMMs' proficiency in discerning video quality. a) To ensure video source diversity, Q-Bench-Video encompasses videos from natural scenes, AI-generated Content (AIGC), and Computer Graphics (CG). b) Building on the traditional multiple-choice questions format with the Yes-or-No and What-How categories, we include Open-ended questions to better evaluate complex scenarios. Additionally, we incorporate the video pair quality comparison question to enhance comprehensiveness. c) Beyond the traditional Technical, Aesthetic, and Temporal distortions, we have expanded our evaluation aspects to include the dimension of AIGC distortions, which addresses the increasing demand for video generation. Finally, we collect a total of 2,378 question-answer pairs and test them on 12 open-source & 5 proprietary LMMs. Our findings indicate that while LMMs have a foundational understanding of video quality, their performance remains incomplete and imprecise, with a notable discrepancy compared to human performance. Through Q-Bench-Video, we seek to catalyze community interest, stimulate further research, and unlock the untapped potential of LMMs to close the gap in video quality understanding.
翻译:随着针对视频理解的大型多模态模型研究兴趣日益增长,许多研究侧重于通用的视频理解能力,却忽视了对视频质量理解的系统性探索。为弥补这一不足,本文引入Q-Bench-Video——一个专门用于评估LMMs在辨别视频质量方面能力的新基准。a) 为确保视频来源的多样性,Q-Bench-Video涵盖了来自自然场景、AI生成内容以及计算机图形的视频。b) 在传统的包含“是或否”与“是什么-怎么样”类别的多项选择题形式基础上,我们引入了开放式问题以更好地评估复杂场景。此外,我们加入了视频对质量比较问题以增强评估的全面性。c) 除了传统的技术性、审美性和时间性失真外,我们将评估维度扩展至AIGC失真,以满足日益增长的视频生成需求。最终,我们收集了总计2,378个问答对,并在12个开源及5个专有LMMs上进行了测试。我们的研究结果表明,尽管LMMs对视频质量具备基础理解,但其表现仍不完整且不精确,与人类表现存在显著差距。通过Q-Bench-Video,我们旨在激发社区兴趣,推动进一步研究,并释放LMMs在缩小视频质量理解差距方面的未开发潜力。