In the quest for artificial general intelligence, Multi-modal Large Language Models (MLLMs) have emerged as a focal point in recent advancements. However, the predominant focus remains on developing their capabilities in static image understanding. The potential of MLLMs in processing sequential visual data is still insufficiently explored, highlighting the absence of a comprehensive, high-quality assessment of their performance. In this paper, we introduce Video-MME, the first-ever full-spectrum, Multi-Modal Evaluation benchmark of MLLMs in Video analysis. Our work distinguishes from existing benchmarks through four key features: 1) Diversity in video types, spanning 6 primary visual domains with 30 subfields to ensure broad scenario generalizability; 2) Duration in temporal dimension, encompassing both short-, medium-, and long-term videos, ranging from 11 seconds to 1 hour, for robust contextual dynamics; 3) Breadth in data modalities, integrating multi-modal inputs besides video frames, including subtitles and audios, to unveil the all-round capabilities of MLLMs; 4) Quality in annotations, utilizing rigorous manual labeling by expert annotators to facilitate precise and reliable model assessment. 900 videos with a total of 256 hours are manually selected and annotated by repeatedly viewing all the video content, resulting in 2,700 question-answer pairs. With Video-MME, we extensively evaluate various state-of-the-art MLLMs, including GPT-4 series and Gemini 1.5 Pro, as well as open-source image models like InternVL-Chat-V1.5 and video models like LLaVA-NeXT-Video. Our experiments reveal that Gemini 1.5 Pro is the best-performing commercial model, significantly outperforming the open-source models. Our dataset along with these findings underscores the need for further improvements in handling longer sequences and multi-modal data. Project Page: https://video-mme.github.io
翻译:在追求通用人工智能的过程中,多模态大语言模型已成为近期进展的焦点。然而,当前研究主要集中于开发其在静态图像理解方面的能力。MLLMs在处理序列视觉数据方面的潜力仍未得到充分探索,突显了对其性能进行全面、高质量评估的缺失。本文介绍了Video-MME,这是首个面向视频分析的全谱多模态大语言模型评估基准。我们的工作通过四个关键特征区别于现有基准:1)视频类型多样性,涵盖6个主要视觉领域和30个子领域,以确保广泛的场景泛化能力;2)时间维度上的时长覆盖,包含短、中、长视频,时长范围从11秒到1小时,以测试鲁棒的上下文动态理解;3)数据模态的广度,除视频帧外还集成了字幕和音频等多模态输入,以全面揭示MLLMs的能力;4)标注质量,采用专家标注员进行严格的人工标注,以促进精确可靠的模型评估。通过反复观看所有视频内容,我们手动筛选并标注了总时长256小时的900个视频,生成了2,700个问答对。基于Video-MME,我们广泛评估了各种最先进的MLLMs,包括GPT-4系列和Gemini 1.5 Pro,以及开源图像模型如InternVL-Chat-V1.5和视频模型如LLaVA-NeXT-Video。实验表明,Gemini 1.5 Pro是性能最佳的商业模型,显著优于开源模型。我们的数据集及这些发现强调了在处理更长序列和多模态数据方面仍需进一步改进。项目页面:https://video-mme.github.io