MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks

The popularity of multimodal large language models (MLLMs) has triggered a recent surge in research efforts dedicated to evaluating these models. Nevertheless, existing evaluation studies of MLLMs primarily focus on the comprehension and reasoning of unimodal (vision) content, neglecting performance evaluations in the domain of multimodal (vision-language) content understanding. Beyond multimodal reasoning, tasks related to multimodal content comprehension necessitate a profound understanding of multimodal contexts, achieved through the multimodal interaction to obtain a final answer. In this paper, we introduce a comprehensive assessment framework called MM-BigBench, which incorporates a diverse range of metrics to offer an extensive evaluation of the performance of various models and instructions across a wide spectrum of diverse multimodal content comprehension tasks. Consequently, our work complements research on the performance of MLLMs in multimodal comprehension tasks, achieving a more comprehensive and holistic evaluation of MLLMs. To begin, we employ the Best Performance metric to ascertain each model's performance upper bound on different datasets. Subsequently, the Mean Relative Gain metric offers an assessment of the overall performance of various models and instructions, while the Stability metric measures their sensitivity. Furthermore, previous research centers on evaluating models independently or solely assessing instructions, neglecting the adaptability between models and instructions. We propose the Adaptability metric to quantify the adaptability between models and instructions. Our paper evaluates a total of 20 language models (14 MLLMs) on 14 multimodal datasets spanning 6 tasks, with 10 instructions for each task, and derives novel insights. Our code will be released at https://github.com/declare-lab/MM-BigBench.

翻译：多模态大语言模型（MLLMs）的普及引发了近期大量致力于评估这些模型的研究工作。然而，现有的MLLM评估研究主要集中于对单模态（视觉）内容的理解与推理，忽视了对多模态（视觉-语言）内容理解领域的性能评估。除多模态推理外，多模态内容理解相关任务需要通过多模态交互获取最终答案，从而深刻理解多模态上下文。本文提出一个名为MM-BigBench的综合评估框架，该框架整合了多种度量指标，可在广泛且多样的多模态内容理解任务中全面评估不同模型与指令的性能。因此，我们的工作补充了MLLM在多模态理解任务上的性能研究，实现了对MLLM更全面、更整体的评估。首先，我们采用最佳性能指标确定各模型在不同数据集上的性能上限。随后，平均相对增益指标评估了不同模型与指令的整体性能，而稳定性指标则衡量其敏感性。此外，以往研究侧重于独立评估模型或仅评估指令，忽略了模型与指令之间的适配性。我们提出适配性指标，用于量化模型与指令之间的适配程度。本文共在涵盖6项任务的14个多模态数据集上评估了20个语言模型（其中14个MLLM），每项任务使用10条指令，并得出了新颖的见解。我们的代码将发布在https://github.com/declare-lab/MM-BigBench。