MM-Vet, with open-ended vision-language questions targeting at evaluating integrated capabilities, has become one of the most popular benchmarks for large multimodal model evaluation. MM-Vet assesses six core vision-language (VL) capabilities: recognition, knowledge, spatial awareness, language generation, OCR, and math. However, its question format is restricted to single image-text pairs, lacking the interleaved image and text sequences prevalent in real-world scenarios. To address this limitation, we introduce MM-Vet v2, which includes a new VL capability called "image-text sequence understanding", evaluating models' ability to process VL sequences. Furthermore, we maintain the high quality of evaluation samples while further expanding the evaluation set size. Using MM-Vet v2 to benchmark large multimodal models, we found that Claude 3.5 Sonnet is the best model with a score of 71.8, slightly outperforming GPT-4o which scored 71.0. Among open-weight models, InternVL2-Llama3-76B leads with a score of 68.4. The code, data, and leaderboard are accessible at https://github.com/yuweihao/MM-Vet.
翻译:MM-Vet 通过开放式视觉语言问题评估综合能力,已成为大型多模态模型评估中最受欢迎的基准之一。MM-Vet 评估六项核心视觉语言(VL)能力:识别、知识、空间感知、语言生成、OCR 和数学。然而,其问题格式仅限于单图像-文本对,缺乏现实场景中普遍存在的交错图像和文本序列。为弥补这一局限,我们提出了 MM-Vet v2,引入了一项称为“图像-文本序列理解”的新 VL 能力,用于评估模型处理 VL 序列的能力。此外,我们在保持评估样本高质量的同时,进一步扩大了评估集的规模。使用 MM-Vet v2 对大型多模态模型进行基准测试,我们发现 Claude 3.5 Sonnet 以 71.8 分成为最佳模型,略高于得分为 71.0 的 GPT-4o。在开源权重模型中,InternVL2-Llama3-76B 以 68.4 分领先。代码、数据和排行榜可通过 https://github.com/yuweihao/MM-Vet 获取。