Are We on the Right Way for Evaluating Large Vision-Language Models?

Large vision-language models (LVLMs) have recently achieved rapid progress, sparking numerous studies to evaluate their multi-modal capabilities. However, we dig into current evaluation works and identify two primary issues: 1) Visual content is unnecessary for many samples. The answers can be directly inferred from the questions and options, or the world knowledge embedded in LLMs. This phenomenon is prevalent across current benchmarks. For instance, GeminiPro achieves 42.9% on the MMMU benchmark without any visual input, and outperforms the random choice baseline across six benchmarks over 20% on average. 2) Unintentional data leakage exists in LLM and LVLM training. LLM and LVLM could still answer some visual-necessary questions without visual content, indicating the memorizing of these samples within large-scale training data. For example, Sphinx-X-MoE gets 43.6% on MMMU without accessing images, surpassing its LLM backbone with 17.9%. Both problems lead to misjudgments of actual multi-modal gains and potentially misguide the study of LVLM. To this end, we present MMStar, an elite vision-indispensable multi-modal benchmark comprising 1,500 samples meticulously selected by humans. MMStar benchmarks 6 core capabilities and 18 detailed axes, aiming to evaluate LVLMs' multi-modal capacities with carefully balanced and purified samples. These samples are first roughly selected from current benchmarks with an automated pipeline, human review is then involved to ensure each curated sample exhibits visual dependency, minimal data leakage, and requires advanced multi-modal capabilities. Moreover, two metrics are developed to measure data leakage and actual performance gain in multi-modal training. We evaluate 16 leading LVLMs on MMStar to assess their multi-modal capabilities, and on 7 benchmarks with the proposed metrics to investigate their data leakage and actual multi-modal gain.

翻译：大型视觉-语言模型（LVLMs）近期取得了快速进展，催生了大量评估其多模态能力的研究。然而，通过深入分析现有评估工作，我们发现了两个主要问题：1）许多样本中视觉内容并非必要。答案可直接从问题与选项中推断得出，或依赖嵌入在大型语言模型中的世界知识。这种现象在当前基准测试中普遍存在。例如，GeminiPro 在 MMMU 基准测试中无需任何视觉输入即可达到 42.9% 的准确率，并在六个基准测试中平均超过随机选择基线 20% 以上。2）LLM 和 LVLM 训练中存在无意的数据泄露。即使没有视觉内容，LLM 和 LVLM 仍能回答部分视觉必需的样本，这表明大规模训练数据中已记忆了这些样本。例如，Sphinx-X-MoE 在不访问图像的情况下于 MMMU 上获得 43.6% 的准确率，超越其 LLM 主干模型 17.9%。这两个问题导致对实际多模态增益的误判，并可能误导 LVLM 研究方向。为此，我们提出了 MMStar——一个精选的视觉不可或缺的多模态基准测试，包含 1500 个由人工精心筛选的样本。MMStar 评估 6 项核心能力与 18 个详细维度，旨在通过精心平衡和纯净的样本评估 LVLM 的多模态能力。这些样本首先通过自动化流程从现有基准中粗选，再经人工审核确保每个样本具有视觉依赖性、最小数据泄露并需要高级多模态能力。此外，我们开发了两项指标来衡量多模态训练中的数据泄露与实际性能增益。我们在 MMStar 上评估了 16 个领先 LVLM 的多模态能力，并在 7 个基准测试中利用所提指标探究其数据泄露与实际多模态增益。