Large Language Models (LLMs) like GPT-4 and LLaMA have shown incredible proficiency at natural language processing tasks and have even begun to excel at tasks across other modalities such as vision and audio. Despite their success, LLMs often struggle to perform well on low-resource languages because there is so little training data available. This shortcoming is especially prevalent with open source models. In this work, we explore training LLaMA-2 to speak Amharic, a language which is spoken by over 50 million people world wide, but has orders of magnitude less data available than languages like English. We employ methods previously used for training LLMs on other languages with data scarcity, and use open source translation models to perform data augmentation and grow our dataset from millions of tokens to billions. We further enhance the capabilities of our model by connecting an image encoder and training on a translated visual instruction tuning dataset in the same manner as LLaVA, resulting in a multimodal Amharic LLM that can understand images along with text. We introduce an Amharic version of a popular benchmarking dataset to evaluate our work. Our models and dataset are open sourced and available on GitHub.
翻译:像GPT-4和LLaMA这样的大语言模型在自然语言处理任务中展现出卓越的能力,甚至开始擅长处理视觉和音频等其他模态的任务。尽管取得了成功,大语言模型在低资源语言上往往表现不佳,原因是可用的训练数据非常有限。这一缺陷在开源模型中尤为突出。本研究探索训练LLaMA-2学习阿姆哈拉语——一种全球超过5000万人使用、但可用数据量比英语等语言低几个数量级的语言。我们采用此前用于训练其他数据稀缺语言大语言模型的方法,利用开源翻译模型进行数据增强,将数据集从数百万词元扩展到数十亿词元。进一步地,我们通过连接图像编码器并采用与LLaVA相同的训练方式在翻译后的视觉指令微调数据集上进行训练,从而增强模型能力,最终实现一个能够同时理解图像与文本的多模态阿姆哈拉语大语言模型。我们还引入了一个流行基准数据集的阿姆哈拉语版本用于评估工作。我们的模型与数据集已开源并发布于GitHub。