The performance differential of large language models (LLM) between languages hinders their effective deployment in many regions, inhibiting the potential economic and societal value of generative AI tools in many communities. However, the development of functional LLMs in many languages (\ie, multilingual LLMs) is bottlenecked by the lack of high-quality evaluation resources in languages other than English. Moreover, current practices in multilingual benchmark construction often translate English resources, ignoring the regional and cultural knowledge of the environments in which multilingual systems would be used. In this work, we construct an evaluation suite of 197,243 QA pairs from local exam sources to measure the capabilities of multilingual LLMs in a variety of regional contexts. Our novel resource, INCLUDE, is a comprehensive knowledge- and reasoning-centric benchmark across 44 written languages that evaluates multilingual LLMs for performance in the actual language environments where they would be deployed.
翻译:大型语言模型(LLM)在不同语言间的性能差异阻碍了其在许多地区的有效部署,限制了生成式人工智能工具在众多社群中潜在的经济与社会价值。然而,许多语言(即多语言LLM)的功能性开发受限于非英语高质量评估资源的匮乏。此外,当前多语言基准构建的常见做法往往直接翻译英语资源,忽视了多语言系统实际使用环境中的区域与文化知识。本研究通过从本地考试资料中构建包含197,243个问答对的评估套件,用于衡量多语言LLM在多样化区域语境中的能力。我们提出的新型资源INCLUDE是一个涵盖44种书面语言的综合性知识与推理中心化基准,旨在评估多语言LLM在其实际部署语言环境中的性能表现。