Large Language Models (LLMs) are increasingly bringing advances to Natural Language Processing. However, low-resource languages, those lacking extensive prominence in datasets for various NLP tasks, or where existing datasets are not as substantial, such as Portuguese, already obtain several benefits from LLMs, but not to the same extent. LLMs trained on multilingual datasets normally struggle to respond to prompts in Portuguese satisfactorily, presenting, for example, code switching in their responses. This work proposes a fine-tuned LLaMA 2-based model for Portuguese prompts named Bode in two versions: 7B and 13B. We evaluate the performance of this model in classification tasks using the zero-shot approach with in-context learning, and compare it with other LLMs. Our main contribution is to bring an LLM with satisfactory results in the Portuguese language, as well as to provide a model that is free for research or commercial purposes.
翻译:摘要:大型语言模型(LLMs)正不断推动自然语言处理领域的进步。然而,对于低资源语言——即在各种NLP任务数据集中缺乏广泛突出地位、或现有数据集规模不足的语言(如葡萄牙语),LLMs虽已带来若干益处,但效果仍较为有限。基于多语言数据集训练的LLMs通常难以令人满意地响应葡萄牙语提示,例如在其回复中会出现语码转换现象。本研究提出了一种基于LLaMA 2微调、专用于葡萄牙语提示的模型Bode,包含7B和13B两个版本。我们采用基于上下文学习的零样本方法评估该模型在分类任务中的性能,并将其与其他LLMs进行对比。我们的主要贡献在于提供了一种在葡萄牙语中表现优异的LLM,同时确保模型可免费用于研究或商业目的。