The rise of Large Language Models (LLMs) has revolutionized natural language processing across numerous languages and tasks. However, evaluating LLM performance in a consistent and meaningful way across multiple European languages remains challenging, especially due to the scarcity of multilingual benchmarks. We introduce a cross-lingual evaluation approach tailored for European languages. We employ translated versions of five widely-used benchmarks to assess the capabilities of 40 LLMs across 21 European languages. Our contributions include examining the effectiveness of translated benchmarks, assessing the impact of different translation services, and offering a multilingual evaluation framework for LLMs that includes newly created datasets: EU20-MMLU, EU20-HellaSwag, EU20-ARC, EU20-TruthfulQA, and EU20-GSM8K. The benchmarks and results are made publicly available to encourage further research in multilingual LLM evaluation.
翻译:大型语言模型(LLM)的兴起彻底改变了多种语言和任务的自然语言处理。然而,由于多语言基准测试的稀缺,以一致且有意义的方式评估LLM在多种欧洲语言上的性能仍然具有挑战性。我们提出了一种针对欧洲语言量身定制的跨语言评估方法。我们采用五个广泛使用的基准测试的翻译版本,评估了40个LLM在21种欧洲语言上的能力。我们的贡献包括:检验翻译基准测试的有效性、评估不同翻译服务的影响,以及提供一个包含新创建数据集的多语言LLM评估框架,这些数据集包括:EU20-MMLU、EU20-HellaSwag、EU20-ARC、EU20-TruthfulQA和EU20-GSM8K。基准测试和结果已公开提供,以鼓励多语言LLM评估的进一步研究。