Large language models (LLMs) have gained popularity recently due to their outstanding performance in various downstream Natural Language Processing (NLP) tasks. However, low-resource languages are still lagging behind current state-of-the-art (SOTA) developments in the field of NLP due to insufficient resources to train LLMs. Ethiopian languages exhibit remarkable linguistic diversity, encompassing a wide array of scripts, and are imbued with profound religious and cultural significance. This paper introduces EthioLLM -- multilingual large language models for five Ethiopian languages (Amharic, Ge'ez, Afan Oromo, Somali, and Tigrinya) and English, and Ethiobenchmark -- a new benchmark dataset for various downstream NLP tasks. We evaluate the performance of these models across five downstream NLP tasks. We open-source our multilingual language models, new benchmark datasets for various downstream tasks, and task-specific fine-tuned language models and discuss the performance of the models. Our dataset and models are available at the https://huggingface.co/EthioNLP repository.
翻译:近年来,大语言模型(LLMs)由于在下游自然语言处理(NLP)任务中的卓越表现而广受关注。然而,由于缺乏训练LLMs的充足资源,低资源语言在NLP领域的最新进展中仍处于滞后状态。埃塞俄比亚语言展现出显著的语言多样性,涵盖多种文字体系,并承载着深厚的宗教与文化内涵。本文提出EthioLLM——面向五种埃塞俄比亚语言(阿姆哈拉语、吉兹语、阿凡奥罗莫语、索马里语和提格雷尼亚语)及英语的多语种大语言模型,同时构建Ethiobenchmark——面向多种下游NLP任务的新型基准数据集。我们评估了这些模型在五项下游NLP任务中的表现。我们开源了多语种语言模型、面向多种下游任务的新型基准数据集及任务特定微调语言模型,并讨论了模型性能。数据集与模型已发布于https://huggingface.co/EthioNLP资源库。