Are We on the Right Way for Evaluating Large Vision-Language Models?

Large vision-language models (LVLMs) have recently achieved rapid progress, sparking numerous studies to evaluate their multi-modal capabilities. However, we dig into current evaluation works and identify two primary issues: 1) Visual content is unnecessary for many samples. The answers can be directly inferred from the questions and options, or the world knowledge embedded in LLMs. This phenomenon is prevalent across current benchmarks. For instance, GeminiPro achieves 42.9% on the MMMU benchmark without any visual input, and outperforms the random choice baseline across six benchmarks over 24% on average. 2) Unintentional data leakage exists in LLM and LVLM training. LLM and LVLM could still answer some visual-necessary questions without visual content, indicating the memorizing of these samples within large-scale training data. For example, Sphinx-X-MoE gets 43.6% on MMMU without accessing images, surpassing its LLM backbone with 17.9%. Both problems lead to misjudgments of actual multi-modal gains and potentially misguide the study of LVLM. To this end, we present MMStar, an elite vision-indispensable multi-modal benchmark comprising 1,500 samples meticulously selected by humans. MMStar benchmarks 6 core capabilities and 18 detailed axes, aiming to evaluate LVLMs' multi-modal capacities with carefully balanced and purified samples. These samples are first roughly selected from current benchmarks with an automated pipeline, human review is then involved to ensure each curated sample exhibits visual dependency, minimal data leakage, and requires advanced multi-modal capabilities. Moreover, two metrics are developed to measure data leakage and actual performance gain in multi-modal training. We evaluate 16 leading LVLMs on MMStar to assess their multi-modal capabilities, and on 7 benchmarks with the proposed metrics to investigate their data leakage and actual multi-modal gain.

翻译：大型视觉语言模型（LVLMs）近期取得了快速进展，催生了大量评估其多模态能力的研究。然而，我们深入分析当前评估工作并发现两个主要问题：1）许多样本不需要视觉内容。答案可直接从问题与选项中推断，或利用嵌入在大型语言模型中的世界知识获取。这一现象在现有基准中普遍存在。例如，GeminiPro 在无需任何视觉输入的情况下，于 MMMU 基准上达到 42.9% 的准确率，并在六个基准上平均超越随机基线超过 24%。2）LLM 与 LVLM 训练中存在无意的数据泄露。LLM 和 LVLM 在缺乏视觉内容时仍能回答部分视觉必要问题，表明这些样本已被大规模训练数据记忆。例如，Sphinx-X-MoE 在不访问图片的情况下于 MMMU 上取得 43.6% 的准确率，远超其 LLM 基线的 17.9%。这两个问题导致对实际多模态增益的误判，并可能误导 LVLM 研究。为此，我们提出 MMStar，一个精选的视觉必要多模态基准，包含由人工精心挑选的 1500 个样本。MMStar 评估 6 项核心能力与 18 个精细维度，旨在通过精心平衡与净化的样本评估 LVLMs 的多模态能力。这些样本首先通过自动化流水线从现有基准中初步筛选，随后引入人工审核，确保每个精选样本具备视觉依赖性、最小数据泄露，并需要高级多模态能力。此外，我们开发了两个指标来衡量多模态训练中的数据泄露与实际性能增益。我们在 MMStar 上评估了 16 个领先的 LVLM，以衡量其多模态能力，并在 7 个基准上使用所提指标探究其数据泄露与实际多模态增益。