Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu,Yuhan Dai,Yondong Luo,Lei Li,Shuhuai Ren,Renrui Zhang,Zihan Wang,Chenyu Zhou,Yunhang Shen,Mengdan Zhang,Peixian Chen,Yanwei Li,Shaohui Lin,Sirui Zhao,Ke Li,Tong Xu,Xiawu Zheng,Enhong Chen,Rongrong Ji,Xing Sun

from arxiv, Project Page: https://video-mme.github.io

In the quest for artificial general intelligence, Multi-modal Large Language Models (MLLMs) have emerged as a focal point in recent advancements. However, the predominant focus remains on developing their capabilities in static image understanding. The potential of MLLMs in processing sequential visual data is still insufficiently explored, highlighting the absence of a comprehensive, high-quality assessment of their performance. In this paper, we introduce Video-MME, the first-ever full-spectrum, Multi-Modal Evaluation benchmark of MLLMs in Video analysis. Our work distinguishes from existing benchmarks through four key features: 1) Diversity in video types, spanning 6 primary visual domains with 30 subfields to ensure broad scenario generalizability; 2) Duration in temporal dimension, encompassing both short-, medium-, and long-term videos, ranging from 11 seconds to 1 hour, for robust contextual dynamics; 3) Breadth in data modalities, integrating multi-modal inputs besides video frames, including subtitles and audios, to unveil the all-round capabilities of MLLMs; 4) Quality in annotations, utilizing rigorous manual labeling by expert annotators to facilitate precise and reliable model assessment. 900 videos with a total of 256 hours are manually selected and annotated by repeatedly viewing all the video content, resulting in 2,700 question-answer pairs. With Video-MME, we extensively evaluate various state-of-the-art MLLMs, including GPT-4 series and Gemini 1.5 Pro, as well as open-source image models like InternVL-Chat-V1.5 and video models like LLaVA-NeXT-Video. Our experiments reveal that Gemini 1.5 Pro is the best-performing commercial model, significantly outperforming the open-source models. Our dataset along with these findings underscores the need for further improvements in handling longer sequences and multi-modal data. Project Page: https://video-mme.github.io

翻译：在追求通用人工智能的过程中，多模态大语言模型已成为近期进展的焦点。然而，当前研究主要集中于开发其在静态图像理解方面的能力。MLLMs在处理序列视觉数据方面的潜力仍未得到充分探索，突显了对其性能进行全面、高质量评估的缺失。本文介绍了Video-MME，这是首个面向视频分析的全谱多模态大语言模型评估基准。我们的工作通过四个关键特征区别于现有基准：1）视频类型多样性，涵盖6个主要视觉领域和30个子领域，以确保广泛的场景泛化能力；2）时间维度上的时长覆盖，包含短、中、长视频，时长范围从11秒到1小时，以测试鲁棒的上下文动态理解；3）数据模态的广度，除视频帧外还集成了字幕和音频等多模态输入，以全面揭示MLLMs的能力；4）标注质量，采用专家标注员进行严格的人工标注，以促进精确可靠的模型评估。通过反复观看所有视频内容，我们手动筛选并标注了总时长256小时的900个视频，生成了2,700个问答对。基于Video-MME，我们广泛评估了各种最先进的MLLMs，包括GPT-4系列和Gemini 1.5 Pro，以及开源图像模型如InternVL-Chat-V1.5和视频模型如LLaVA-NeXT-Video。实验表明，Gemini 1.5 Pro是性能最佳的商业模型，显著优于开源模型。我们的数据集及这些发现强调了在处理更长序列和多模态数据方面仍需进一步改进。项目页面：https://video-mme.github.io

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日