Multi-lingual competence in large language models is often evaluated via static data benchmarks such as Belebele, M-MMLU and M-GSM. However, these evaluations often fail to provide an adequate understanding of the practical performance and robustness of models across multi-lingual settings. In response, we create multi-lingual functional benchmarks -- Cross-Lingual Grade School Math Symbolic (CL-GSM Symbolic) and Cross-Lingual Instruction-Following Eval (CL-IFEval)-- by translating existing functional benchmark templates from English to five additional languages that span the range of resources available for NLP: French, Spanish, Hindi, Arabic and Yoruba. Our results reveal that some static multi-lingual benchmarks capture functional performance much more closely than others (i.e. across models, there is a 24%, 17% and 18% decrease in performance between M-GSM and CL-GSM Symbolic in English, French and Spanish respectively; similarly there's a 15 - 24% performance drop across languages between Belebele and CL-IFEval, and only a 0.5% to 3% performance drop between M-MMLU and CL-IFEval). Similarly, we find that model robustness across languages varies significantly, with certain languages (eg. Arabic, English) being the most consistently well performing across evaluation iterations.
翻译:大型语言模型的多语言能力通常通过静态数据基准进行评估,例如Belebele、M-MMLU和M-GSM。然而,这些评估往往无法充分反映模型在多语言环境下的实际性能和鲁棒性。为此,我们通过将现有的功能基准模板从英语翻译至五种其他语言(涵盖自然语言处理资源可用性的不同范围:法语、西班牙语、印地语、阿拉伯语和约鲁巴语),创建了多语言功能基准——跨语言小学数学符号推理(CL-GSM Symbolic)和跨语言指令遵循评估(CL-IFEval)。我们的结果表明,某些静态多语言基准比其它基准更能准确反映功能性能(例如,在英语、法语和西班牙语中,M-GSM与CL-GSM Symbolic之间的性能分别下降了24%、17%和18%;类似地,Belebele与CL-IFEval在跨语言比较中性能下降15%至24%,而M-MMLU与CL-IFEval之间的性能下降仅为0.5%至3%)。同时,我们发现模型在不同语言间的鲁棒性存在显著差异,某些语言(如阿拉伯语、英语)在多次评估迭代中始终表现最为稳定。