MMLU-ProX：面向先进大语言模型评估的多语言基准 (MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation)

Weihao Xuan,Rui Yang,Heli Qi,Qingcheng Zeng,Yunze Xiao,Yun Xing,Junjue Wang,Huitao Li,Xin Li,Kunyu Yu,Nan Liu,Qingyu Chen,Douglas Teodoro,Edison Marrese-Taylor,Shijian Lu,Yusuke Iwasawa,Yutaka Matsuo,Irene Li

Traditional benchmarks struggle to evaluate increasingly sophisticated language models in multilingual and culturally diverse contexts. To address this gap, we introduce MMLU-ProX, a comprehensive multilingual benchmark covering 13 typologically diverse languages with approximately 11,829 questions per language. Building on the challenging reasoning-focused design of MMLU-Pro, our framework employs a semi-automatic translation process: translations generated by state-of-the-art large language models (LLMs) are rigorously evaluated by expert annotators to ensure conceptual accuracy, terminological consistency, and cultural relevance. We comprehensively evaluate 25 state-of-the-art LLMs using 5-shot chain-of-thought (CoT) and zero-shot prompting strategies, analyzing their performance across linguistic and cultural boundaries. Our experiments reveal consistent performance degradation from high-resource languages to lower-resource ones, with the best models achieving over 70% accuracy on English but dropping to around 40% for languages like Swahili, highlighting persistent gaps in multilingual capabilities despite recent advances. MMLU-ProX is an ongoing project; we are expanding our benchmark by incorporating additional languages and evaluating more language models to provide a more comprehensive assessment of multilingual capabilities.

翻译：传统基准在多语言及文化多样化的语境下难以有效评估日益复杂的语言模型。为弥补这一空白，我们提出了MMLU-ProX——一个涵盖13种类型学上多样化语言的综合性多语言基准，每种语言包含约11,829道问题。基于MMLU-Pro聚焦复杂推理的挑战性设计框架，本研究采用半自动翻译流程：通过前沿大语言模型生成的译文由专家标注员进行严格评估，以确保概念准确性、术语一致性与文化适配性。我们采用5样本思维链与零样本提示策略对25个前沿大语言模型进行全面评估，系统分析其在跨语言及跨文化边界上的性能表现。实验结果表明，模型从高资源语言到低资源语言存在持续的性能衰减现象：最优模型在英语上准确率超过70%，而在斯瓦希里语等语言上则降至约40%，这揭示了尽管近期取得进展，多语言能力仍存在显著差距。MMLU-ProX是一个持续发展的项目，我们正通过纳入更多语言和评估更多语言模型来扩展基准范围，以提供更全面的多语言能力评估体系。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/