Large language models (LLMs) have received a lot of attention in natural language processing (NLP) research because of their exceptional performance in understanding and generating human languages. However, low-resource languages are left behind due to the unavailability of resources. In this work, we focus on enhancing the LLaMA-2-Amharic model by integrating task-specific and generative datasets to improve language model performance for Amharic. We compile an Amharic instruction fine-tuning dataset and fine-tuned LLaMA-2-Amharic model. The fine-tuned model shows promising results in different NLP tasks. We open-source our dataset creation pipeline, instruction datasets, trained models, and evaluation outputs to promote language-specific studies on these models.
翻译:大型语言模型(LLMs)因其在人类语言理解与生成任务中表现出的卓越性能,已在自然语言处理(NLP)研究中获得广泛关注。然而,由于资源匮乏,低资源语言的发展仍面临滞后。本研究聚焦于通过整合任务特定数据集与生成式数据集,提升阿姆哈拉语语言模型性能,具体对LLaMA-2-Amharic模型进行增强。我们构建了阿姆哈拉语指令微调数据集,并对LLaMA-2-Amharic模型进行了微调。微调后的模型在不同NLP任务中展现出令人瞩目的效果。为促进针对此类模型的特色语言研究,我们开源了数据集构建流程、指令数据集、训练模型及评估结果。