Multilingual multimodal reasoning is a core component in achieving human-level intelligence. However, most existing benchmarks for multilingual multimodal reasoning struggle to differentiate between models of varying performance; even language models without visual capabilities can easily achieve high scores. This leaves a comprehensive evaluation of leading multilingual multimodal models largely unexplored. In this work, we introduce M4U, a novel and challenging benchmark for assessing the capability of multi-discipline multilingual multimodal understanding and reasoning. M4U contains 8,931 samples covering 64 disciplines across 16 subfields in Science, Engineering, and Healthcare in Chinese, English, and German. Using M4U, we conduct extensive evaluations of 21 leading Large Multimodal Models (LMMs) and Large Language Models (LLMs) with external tools. The evaluation results show that the state-of-the-art model, GPT-4o, achieves only 47.6% average accuracy on M4U. Additionally, we observe that the leading LMMs exhibit significant language preferences. Our in-depth analysis indicates that leading LMMs, including GPT-4o, suffer performance degradation when prompted with cross-lingual multimodal questions, such as images with key textual information in Chinese while the question is in German. We believe that M4U can serve as a crucial tool for systematically evaluating LMMs based on their multilingual multimodal reasoning capabilities and monitoring their development. The homepage, codes and data are public available.
翻译:多语言多模态推理是实现人类水平智能的核心组成部分。然而,现有的大多数多语言多模态推理基准难以有效区分不同性能的模型;即使不具备视觉能力的语言模型也能轻易获得高分。这使得对主流多语言多模态模型的全面评估在很大程度上仍未被探索。在本研究中,我们提出了M4U——一个新颖且具有挑战性的基准,用于评估跨学科多语言多模态理解与推理能力。M4U包含8,931个样本,涵盖科学、工程和医疗健康三大领域中16个子领域的64个学科,语言涵盖中文、英文和德文。基于M4U,我们对21个领先的大型多模态模型及借助外部工具的大型语言模型进行了广泛评估。评估结果表明,当前最先进的模型GPT-4o在M4U上的平均准确率仅为47.6%。此外,我们发现领先的LMMs表现出显著的语言偏好。我们的深入分析表明,包括GPT-4o在内的主流LMMs在面对跨语言多模态问题时(例如图像关键文本信息为中文而问题语言为德文)会出现性能下降。我们相信M4U能够作为基于多语言多模态推理能力系统评估LMMs并追踪其发展的重要工具。项目主页、代码与数据均已公开。