The rise of Large Language Models (LLMs) has revolutionized natural language processing across numerous languages and tasks. However, evaluating LLM performance in a consistent and meaningful way across multiple European languages remains challenging, especially due to the scarcity of language-parallel multilingual benchmarks. We introduce a multilingual evaluation approach tailored for European languages. We employ translated versions of five widely-used benchmarks to assess the capabilities of 40 LLMs across 21 European languages. Our contributions include examining the effectiveness of translated benchmarks, assessing the impact of different translation services, and offering a multilingual evaluation framework for LLMs that includes newly created datasets: EU20-MMLU, EU20-HellaSwag, EU20-ARC, EU20-TruthfulQA, and EU20-GSM8K. The benchmarks and results are made publicly available to encourage further research in multilingual LLM evaluation.
翻译:大型语言模型(LLM)的兴起彻底改变了众多语言和任务的自然语言处理。然而,由于缺乏语言平行的多语言基准,以一致且有意义的方式评估LLM在多种欧洲语言上的性能仍然具有挑战性。我们提出了一种针对欧洲语言定制的多语言评估方法。我们采用五个广泛使用的基准的翻译版本,评估了40个LLM在21种欧洲语言上的能力。我们的贡献包括:检验翻译基准的有效性,评估不同翻译服务的影响,以及提供一个包含新创建数据集(EU20-MMLU、EU20-HellaSwag、EU20-ARC、EU20-TruthfulQA和EU20-GSM8K)的LLM多语言评估框架。基准和结果已公开,以鼓励多语言LLM评估的进一步研究。