Traditional benchmarks struggle to evaluate increasingly sophisticated language models in multilingual and culturally diverse contexts. To address this gap, we introduce MMLU-ProX, a comprehensive multilingual benchmark covering 13 typologically diverse languages with approximately 11,829 questions per language. Building on the challenging reasoning-focused design of MMLU-Pro, our framework employs a semi-automatic translation process: translations generated by state-of-the-art large language models (LLMs) are rigorously evaluated by expert annotators to ensure conceptual accuracy, terminological consistency, and cultural relevance. We comprehensively evaluate 25 state-of-the-art LLMs using 5-shot chain-of-thought (CoT) and zero-shot prompting strategies, analyzing their performance across linguistic and cultural boundaries. Our experiments reveal consistent performance degradation from high-resource languages to lower-resource ones, with the best models achieving over 70% accuracy on English but dropping to around 40% for languages like Swahili, highlighting persistent gaps in multilingual capabilities despite recent advances. MMLU-ProX is an ongoing project; we are expanding our benchmark by incorporating additional languages and evaluating more language models to provide a more comprehensive assessment of multilingual capabilities.
翻译:传统基准在多语言及文化多样化的语境下难以有效评估日益复杂的语言模型。为弥补这一空白,我们提出了MMLU-ProX——一个涵盖13种类型学上多样化语言的综合性多语言基准,每种语言包含约11,829道问题。基于MMLU-Pro聚焦复杂推理的挑战性设计框架,本研究采用半自动翻译流程:通过前沿大语言模型生成的译文由专家标注员进行严格评估,以确保概念准确性、术语一致性与文化适配性。我们采用5样本思维链与零样本提示策略对25个前沿大语言模型进行全面评估,系统分析其在跨语言及跨文化边界上的性能表现。实验结果表明,模型从高资源语言到低资源语言存在持续的性能衰减现象:最优模型在英语上准确率超过70%,而在斯瓦希里语等语言上则降至约40%,这揭示了尽管近期取得进展,多语言能力仍存在显著差距。MMLU-ProX是一个持续发展的项目,我们正通过纳入更多语言和评估更多语言模型来扩展基准范围,以提供更全面的多语言能力评估体系。