Despite the existence of various benchmarks for evaluating natural language processing models, we argue that human exams are a more suitable means of evaluating general intelligence for large language models (LLMs), as they inherently demand a much wider range of abilities such as language understanding, domain knowledge, and problem-solving skills. To this end, we introduce M3Exam, a novel benchmark sourced from real and official human exam questions for evaluating LLMs in a multilingual, multimodal, and multilevel context. M3Exam exhibits three unique characteristics: (1) multilingualism, encompassing questions from multiple countries that require strong multilingual proficiency and cultural knowledge; (2) multimodality, accounting for the multimodal nature of many exam questions to test the model's multimodal understanding capability; and (3) multilevel structure, featuring exams from three critical educational periods to comprehensively assess a model's proficiency at different levels. In total, M3Exam contains 12,317 questions in 9 diverse languages with three educational levels, where about 23\% of the questions require processing images for successful solving. We assess the performance of top-performing LLMs on M3Exam and find that current models, including GPT-4, still struggle with multilingual text, particularly in low-resource and non-Latin script languages. Multimodal LLMs also perform poorly with complex multimodal questions. We believe that M3Exam can be a valuable resource for comprehensively evaluating LLMs by examining their multilingual and multimodal abilities and tracking their development. Data and evaluation code is available at \url{https://github.com/DAMO-NLP-SG/M3Exam}.
翻译:尽管已有各种评估自然语言处理模型的基准,但我们认为人类考试是更适合评估大语言模型通用智能的手段,因为其内在地要求更广泛的能力,如语言理解、领域知识和问题解决技能。为此,我们提出M3Exam,这是一个从真实、官方的人类考试题目中提取的新基准,用于在多语言、多模态和多层级背景下评估大语言模型。M3Exam具有三个独特特征:(1)多语言性,包含来自多个国家的问题,需要较强的多语言能力和文化知识;(2)多模态性,考虑到许多考试题目的多模态性质,以测试模型的多模态理解能力;(3)多层级结构,包含来自三个关键教育阶段的考试,以全面评估模型在不同层级上的能力。M3Exam总共包含9种不同语言的12,317道题目,涵盖三个教育层级,其中约23%的题目需要处理图像才能成功解答。我们评估了顶尖大语言模型在M3Exam上的表现,发现当前模型(包括GPT-4)在多语言文本方面仍存在困难,尤其是在低资源和非拉丁文字母语言中。多模态大语言模型在复杂多模态问题上也表现不佳。我们相信M3Exam可以作为评估大语言模型多语言和多模态能力、追踪其发展的宝贵资源。数据和评估代码可通过以下链接获取:\url{https://github.com/DAMO-NLP-SG/M3Exam}。