As the capabilities of language models continue to advance, it is conceivable that "one-size-fits-all" model will remain as the main paradigm. For instance, given the vast number of languages worldwide, many of which are low-resource, the prevalent practice is to pretrain a single model on multiple languages. In this paper, we add to the growing body of evidence that challenges this practice, demonstrating that monolingual pretraining on the target language significantly improves models already extensively trained on diverse corpora. More specifically, we further pretrain GPT-J and LLaMA models on Portuguese texts using 3% or less of their original pretraining budget. Few-shot evaluations on Poeta, a suite of 14 Portuguese datasets, reveal that our models outperform English-centric and multilingual counterparts by a significant margin. Our best model, Sabi\'a-65B, performs on par with GPT-3.5-turbo. By evaluating on datasets originally conceived in the target language as well as translated ones, we study the contributions of language-specific pretraining in terms of 1) capturing linguistic nuances and structures inherent to the target language, and 2) enriching the model's knowledge about a domain or culture. Our results indicate that the majority of the benefits stem from the domain-specific knowledge acquired through monolingual pretraining.
翻译:随着语言模型能力的持续提升,"一刀切"式模型仍可能保持为主要范式。例如,鉴于全球语言种类繁多且其中许多属于低资源语言,当前普遍做法是在多种语言上预训练单一模型。本文为质疑这一做法的日益增多的证据增添新例,证明在目标语言上进行单语预训练能够显著提升已在大规模多样化语料上训练过的模型。具体而言,我们仅使用原始预训练预算的3%或更少,在葡萄牙语文本上对GPT-J和LLaMA模型进行进一步预训练。在包含14个葡萄牙语数据集的Poeta基准上进行少样本评估表明,我们的模型以显著优势超越以英语为中心和多语言模型。我们最佳模型Sabiá-65B的性能与GPT-3.5-turbo相当。通过评估原始语言数据集及翻译数据集,我们从两方面研究语言特定预训练的贡献:1)捕捉目标语言固有的语言细微差别和结构;2)丰富模型对某一领域或文化的知识。结果表明,大部分收益源于通过单语预训练获得的领域特定知识。