As the capabilities of large multimodal models (LMMs) continue to advance, evaluating the performance of LMMs emerges as an increasing need. Additionally, there is an even larger gap in evaluating the advanced knowledge and reasoning abilities of LMMs in non-English contexts such as Chinese. We introduce CMMMU, a new Chinese Massive Multi-discipline Multimodal Understanding benchmark designed to evaluate LMMs on tasks demanding college-level subject knowledge and deliberate reasoning in a Chinese context. CMMMU is inspired by and strictly follows the annotation and analysis pattern of MMMU. CMMMU includes 12k manually collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering, like its companion, MMMU. These questions span 30 subjects and comprise 39 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures. CMMMU focuses on complex perception and reasoning with domain-specific knowledge in the Chinese context. We evaluate 11 open-source LLMs and one proprietary GPT-4V(ision). Even GPT-4V only achieves accuracies of 42%, indicating a large space for improvement. CMMMU will boost the community to build the next-generation LMMs towards expert artificial intelligence and promote the democratization of LMMs by providing diverse language contexts.
翻译:随着大型多模态模型(LMM)能力的持续提升,评估其性能的需求日益增长。此外,在中文等非英语语境下,对LMM高级知识与推理能力的评估存在更大的缺口。我们提出CMMMU——一个全新的中文大规模多学科多模态理解基准,旨在评估LMM在中文语境中执行需要大学水平学科知识与深度推理的任务。CMMMU遵循MMMU的标注与分析范式设计。该基准包含12000个从大学考试、测验及教材中手工收集的多模态问题,涵盖六大核心学科:艺术与设计、商业、科学、健康与医学、人文与社会科学以及技术与工程,与其同类基准MMMU相同。这些问题覆盖30个学科方向,包含39种高度异构的图像类型,如图表、示意图、地图、表格、乐谱和化学结构式。CMMMU聚焦于中文语境下基于领域知识的复杂感知与推理。我们评估了11个开源LLM和一个商业模型GPT-4V(视觉版)。即便GPT-4V的准确率也仅为42%,表明仍有巨大的提升空间。CMMMU将通过提供多样化语言语境,推动社区构建面向专家级人工智能的下一代LMM,并促进LMM的民主化发展。