Large language models (LLMs) have gained popularity recently due to their outstanding performance in various downstream Natural Language Processing (NLP) tasks. However, low-resource languages are still lagging behind current state-of-the-art (SOTA) developments in the field of NLP due to insufficient resources to train LLMs. Ethiopian languages exhibit remarkable linguistic diversity, encompassing a wide array of scripts, and are imbued with profound religious and cultural significance. This paper introduces EthioLLM -- multilingual large language models for five Ethiopian languages (Amharic, Ge'ez, Afan Oromo, Somali, and Tigrinya) and English, and Ethiobenchmark -- a new benchmark dataset for various downstream NLP tasks. We evaluate the performance of these models across five downstream NLP tasks. We open-source our multilingual language models, new benchmark datasets for various downstream tasks, and task-specific fine-tuned language models and discuss the performance of the models. Our dataset and models are available at the https://huggingface.co/EthioNLP repository.
翻译:大语言模型(LLMs)因其在各种自然语言处理(NLP)下游任务中的卓越表现,近年来广受欢迎。然而,由于缺乏足够资源来训练大语言模型,低资源语言在当前NLP领域的最新技术发展中仍处于落后状态。埃塞俄比亚语言展现出显著的语言多样性,涵盖多种文字体系,并蕴含深厚的宗教与文化意义。本文介绍了EthioLLM——面向五种埃塞俄比亚语言(阿姆哈拉语、吉兹语、阿凡奥罗莫语、索马里语和提格雷尼亚语)及英语的多语言大语言模型,以及Ethiobenchmark——一个面向多种NLP下游任务的新型基准数据集。我们评估了这些模型在五种NLP下游任务上的表现。我们开源了多语言语言模型、面向多种下游任务的新型基准数据集及任务特定的微调语言模型,并讨论了模型的性能表现。我们的数据集和模型可通过https://huggingface.co/EthioNLP仓库获取。