Large language models (LLMs) have received a lot of attention in natural language processing (NLP) research because of their exceptional performance in understanding and generating human languages. However, low-resource languages are left behind due to the unavailability of resources. In this work, we focus on enhancing the LLaMA-2-Amharic model by integrating task-specific and generative datasets to improve language model performance for Amharic. We compile an Amharic instruction fine-tuning dataset and fine-tuned LLaMA-2-Amharic model. The fine-tuned model shows promising results in different NLP tasks. We open-source our dataset creation pipeline, instruction datasets, trained models, and evaluation outputs to promote language-specific studies on these models.
翻译:大型语言模型(LLMs)因其在人类语言理解与生成方面的卓越性能,在自然语言处理(NLP)研究中备受关注。然而,低资源语言因资源匮乏而发展滞后。本研究聚焦于通过整合任务特定数据集与生成型数据集来优化LLaMA-2-Amharic模型,以提升阿姆哈拉语(Amharic)语言模型的性能。我们构建了阿姆哈拉语指令微调数据集,并对LLaMA-2-Amharic模型进行了微调。微调后的模型在多项NLP任务中展现出令人满意的效果。为促进针对这些模型的语言专项研究,我们开源了数据集创建流程、指令数据集、训练模型及评估输出结果。