As the capabilities of language models continue to advance, it is conceivable that "one-size-fits-all" model will remain as the main paradigm. For instance, given the vast number of languages worldwide, many of which are low-resource, the prevalent practice is to pretrain a single model on multiple languages. In this paper, we add to the growing body of evidence that challenges this practice, demonstrating that monolingual pretraining on the target language significantly improves models already extensively trained on diverse corpora. More specifically, we further pretrain GPT-J and LLaMA models on Portuguese texts using 3% or less of their original pretraining budget. Few-shot evaluations on Poeta, a suite of 14 Portuguese datasets, reveal that our models outperform English-centric and multilingual counterparts by a significant margin. Our best model, Sabi\'a-65B, performs on par with GPT-3.5-turbo. By evaluating on datasets originally conceived in the target language as well as translated ones, we study the contributions of language-specific pretraining in terms of 1) capturing linguistic nuances and structures inherent to the target language, and 2) enriching the model's knowledge about a domain or culture. Our results indicate that the majority of the benefits stem from the domain-specific knowledge acquired through monolingual pretraining.
翻译:随着语言模型能力的持续提升,可以预见"一刀切"式模型仍将是主流范式。例如,鉴于全球语言种类繁多(其中许多属于低资源语言),当前普遍做法是在多语言语料上预训练单一模型。本文通过实证研究对这一做法提出质疑,证明在目标语言上进行单语预训练能显著提升已在多样化语料上充分训练的模型性能。具体而言,我们仅使用GPT-J和LLaMA模型原始预训练预算的3%或更少资源,在葡萄牙语文本上对其进行了进一步预训练。在包含14个葡萄牙语数据集的Poeta套件上进行的小样本评估表明,我们的模型以显著优势超越了以英语为中心和多语言的同类模型。其中最佳模型Sabiá-65B的性能与GPT-3.5-turbo持平。通过在原生于目标语言的数据集以及翻译数据集上分别进行评估,我们研究了语言特定预训练在以下两方面的贡献:1)捕捉目标语言固有的语言细微差别与结构特征;2)丰富模型关于特定领域或文化的知识。研究结果表明,单语预训练带来的性能提升主要源于领域特定知识的获取。