Multi-modal large language models(MLLMs) have achieved remarkable progress and demonstrated powerful knowledge comprehension and reasoning abilities. However, the mastery of domain-specific knowledge, which is essential for evaluating the intelligence of MLLMs, continues to be a challenge. Current multi-modal benchmarks for domain-specific knowledge concentrate on multiple-choice questions and are predominantly available in English, which imposes limitations on the comprehensiveness of the evaluation. To this end, we introduce CMMU, a novel benchmark for multi-modal and multi-type question understanding and reasoning in Chinese. CMMU consists of 3,603 questions in 7 subjects, covering knowledge from primary to high school. The questions can be categorized into 3 types: multiple-choice, multiple-response, and fill-in-the-blank, bringing greater challenges to MLLMs. In addition, we propose a rigorous evaluation strategy called ShiftCheck for assessing multiple-choice questions. The strategy aims to reduce position bias, minimize the influence of randomness on correctness, and perform a quantitative analysis of position bias. We evaluate seven open-source MLLMs along with GPT4-V, Gemini-Pro, and Qwen-VL-Plus. The results demonstrate that CMMU poses a significant challenge to the recent MLLMs.
翻译:多模态大语言模型(MLLMs)取得了显著进展,展现出强大的知识理解与推理能力。然而,在评估MLLMs智能水平中至关重要的领域特定知识掌握能力仍然是一个挑战。现有面向领域特定知识的多模态基准测试主要集中在选择题上,且主要以英文形式呈现,这限制了评估的全面性。为此,我们提出CMMU,一个面向中文多模态多类型问题理解与推理的新型基准测试。CMMU包含7个学科共3603道题目,覆盖小学至高中知识内容。题目可分为三种类型:单选题、多选题和填空题,为MLLMs带来更大挑战。此外,我们提出一种名为ShiftCheck的严谨评估策略用于评估选择题。该策略旨在减少位置偏差,最小化随机性对正确率的影响,并对位置偏差进行量化分析。我们评估了七个开源MLLMs以及GPT4-V、Gemini-Pro和Qwen-VL-Plus。结果表明,CMMU对当前主流MLLMs构成了重大挑战。