MM-Vet v2：用于评估大型多模态模型综合能力的挑战性基准 (MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities)

MM-Vet, with open-ended vision-language questions targeting at evaluating integrated capabilities, has become one of the most popular benchmarks for large multimodal model evaluation. MM-Vet assesses six core vision-language (VL) capabilities: recognition, knowledge, spatial awareness, language generation, OCR, and math. However, its question format is restricted to single image-text pairs, lacking the interleaved image and text sequences prevalent in real-world scenarios. To address this limitation, we introduce MM-Vet v2, which includes a new VL capability called "image-text sequence understanding", evaluating models' ability to process VL sequences. Furthermore, we maintain the high quality of evaluation samples while further expanding the evaluation set size. Using MM-Vet v2 to benchmark large multimodal models, we found that Claude 3.5 Sonnet is the best model with a score of 71.8, slightly outperforming GPT-4o which scored 71.0. Among open-weight models, InternVL2-Llama3-76B leads with a score of 68.4. The code, data, and leaderboard are accessible at https://github.com/yuweihao/MM-Vet.

翻译：MM-Vet 通过开放式视觉语言问题评估综合能力，已成为大型多模态模型评估中最受欢迎的基准之一。MM-Vet 评估六项核心视觉语言（VL）能力：识别、知识、空间感知、语言生成、OCR 和数学。然而，其问题格式仅限于单图像-文本对，缺乏现实场景中普遍存在的交错图像和文本序列。为弥补这一局限，我们提出了 MM-Vet v2，引入了一项称为“图像-文本序列理解”的新 VL 能力，用于评估模型处理 VL 序列的能力。此外，我们在保持评估样本高质量的同时，进一步扩大了评估集的规模。使用 MM-Vet v2 对大型多模态模型进行基准测试，我们发现 Claude 3.5 Sonnet 以 71.8 分成为最佳模型，略高于得分为 71.0 的 GPT-4o。在开源权重模型中，InternVL2-Llama3-76B 以 68.4 分领先。代码、数据和排行榜可通过 https://github.com/yuweihao/MM-Vet 获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日