This report presents EuroLLM-22B, a large language model trained from scratch to support the needs of European citizens by covering all 24 official European Union languages and 11 additional languages. EuroLLM addresses the issue of European languages being underrepresented and underserved in existing open large language models. We provide a comprehensive overview of EuroLLM-22B's development, including tokenizer design, architectural specifications, data filtering, and training procedures. Across a broad set of multilingual benchmarks, EuroLLM-22B demonstrates strong performance in reasoning, instruction following, and translation, achieving results competitive with models of comparable size. To support future research, we release our base and instruction-tuned models, our multilingual web pretraining data and updated EuroBlocks instruction datasets, as well as our pre-training and evaluation codebases.
翻译:本报告介绍了EuroLLM-22B,这是一个从头开始训练的大型语言模型,旨在通过覆盖全部24种欧盟官方语言和11种额外语言来满足欧洲公民的需求。EuroLLM旨在解决现有开源大型语言模型中欧洲语言代表性不足和服务欠缺的问题。我们全面概述了EuroLLM-22B的开发过程,包括分词器设计、架构规格、数据筛选和训练流程。在一系列广泛的多语言基准测试中,EuroLLM-22B在推理、指令遵循和翻译方面表现出色,取得了与同类规模模型相竞争的结果。为支持未来研究,我们发布了我们的基础模型和指令微调模型、我们的多语言网络预训练数据及更新的EuroBlocks指令数据集,以及我们的预训练和评估代码库。