Large language models (LLMs) have received a lot of attention in natural language processing (NLP) research because of their exceptional performance in understanding and generating human languages. However, low-resource languages are left behind due to the unavailability of resources. In this work, we focus on enhancing the LLaMA-2-Amharic model by integrating task-specific and generative datasets to improve language model performance for Amharic. We compile an Amharic instruction fine-tuning dataset and fine-tuned LLaMA-2-Amharic model. The fine-tuned model shows promising results in different NLP tasks. We open-source our dataset creation pipeline, instruction datasets, trained models, and evaluation outputs to promote language-specific studies on these models.
翻译:大语言模型(LLMs)因其在理解与生成人类语言方面的卓越性能,在自然语言处理(NLP)研究中备受关注。然而,低资源语言因资源匮乏而发展滞后。本研究聚焦于通过整合任务特定与生成式数据集优化LLaMA-2-阿姆哈拉语模型,以提升阿姆哈拉语的语言模型性能。我们构建了阿姆哈拉语指令微调数据集,并对LLaMA-2-阿姆哈拉语模型进行了微调。实验表明,微调后的模型在多种NLP任务中展现出显著成效。为促进针对该模型的语种专项研究,我们开源了数据集构建流程、指令数据集、训练模型及评估输出。