Sabiá: Portuguese Large Language Models

As the capabilities of language models continue to advance, it is conceivable that "one-size-fits-all" model will remain as the main paradigm. For instance, given the vast number of languages worldwide, many of which are low-resource, the prevalent practice is to pretrain a single model on multiple languages. In this paper, we add to the growing body of evidence that challenges this practice, demonstrating that monolingual pretraining on the target language significantly improves models already extensively trained on diverse corpora. More specifically, we further pretrain GPT-J and LLaMA models on Portuguese texts using 3% or less of their original pretraining budget. Few-shot evaluations on Poeta, a suite of 14 Portuguese datasets, reveal that our models outperform English-centric and multilingual counterparts by a significant margin. Our best model, Sabi\'a-65B, performs on par with GPT-3.5-turbo. By evaluating on datasets originally conceived in the target language as well as translated ones, we study the contributions of language-specific pretraining in terms of 1) capturing linguistic nuances and structures inherent to the target language, and 2) enriching the model's knowledge about a domain or culture. Our results indicate that the majority of the benefits stem from the domain-specific knowledge acquired through monolingual pretraining.

翻译：随着语言模型能力的持续提升，可以预见"一刀切"式模型仍将是主流范式。例如，鉴于全球语言种类繁多（其中许多属于低资源语言），当前普遍做法是在多语言语料上预训练单一模型。本文通过实证研究对这一做法提出质疑，证明在目标语言上进行单语预训练能显著提升已在多样化语料上充分训练的模型性能。具体而言，我们仅使用GPT-J和LLaMA模型原始预训练预算的3%或更少资源，在葡萄牙语文本上对其进行了进一步预训练。在包含14个葡萄牙语数据集的Poeta套件上进行的小样本评估表明，我们的模型以显著优势超越了以英语为中心和多语言的同类模型。其中最佳模型Sabiá-65B的性能与GPT-3.5-turbo持平。通过在原生于目标语言的数据集以及翻译数据集上分别进行评估，我们研究了语言特定预训练在以下两方面的贡献：1）捕捉目标语言固有的语言细微差别与结构特征；2）丰富模型关于特定领域或文化的知识。研究结果表明，单语预训练带来的性能提升主要源于领域特定知识的获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/