Multi-modal large language models(MLLMs) have achieved remarkable progress and demonstrated powerful knowledge comprehension and reasoning abilities. However, the mastery of domain-specific knowledge, which is essential for evaluating the intelligence of MLLMs, continues to be a challenge. Current multi-modal benchmarks for domain-specific knowledge concentrate on multiple-choice questions and are predominantly available in English, which imposes limitations on the comprehensiveness of the evaluation. To this end, we introduce CMMU, a novel benchmark for multi-modal and multi-type question understanding and reasoning in Chinese. CMMU consists of 3,603 questions in 7 subjects, covering knowledge from primary to high school. The questions can be categorized into 3 types: multiple-choice, multiple-response, and fill-in-the-blank, bringing greater challenges to MLLMs. In addition, we propose a rigorous evaluation strategy called ShiftCheck for assessing multiple-choice questions. The strategy aims to reduce position bias, minimize the influence of randomness on correctness, and perform a quantitative analysis of position bias. We evaluate seven open-source MLLMs along with GPT4-V, Gemini-Pro, and Qwen-VL-Plus. The results demonstrate that CMMU poses a significant challenge to the recent MLLMs.
翻译:多模态大型语言模型(MLLMs)已取得显著进展,展现出强大的知识理解与推理能力。然而,评估MLLM智能所必需的领域特定知识掌握仍具挑战性。当前面向领域知识的的多模态基准集中于选择题且主要为英文,这限制了评估的全面性。为此,我们提出CMMU——一个面向中文的多模态多类型问题理解与推理新基准。CMMU包含7门学科的3603道题目,覆盖小学至高中知识。题目可划分为选择题、多选题和填空题三类,为MLLM带来了更大挑战。此外,我们提出一种严格的评估策略ShiftCheck用于评估选择题,旨在降低位置偏差、最小化随机性对正确率的影响,并对位置偏差进行定量分析。我们评估了七个开源MLLM以及GPT4-V、Gemini-Pro和Qwen-VL-Plus。结果表明,CMMU对当前MLLM构成了重大挑战。